arXiv preprint arXiv:2501.13987

· 2025 · arXiv 2501.13987

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

representative citing papers

LoopQ: Quantization for Recursive Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

LoopQ provides a loop-aware PTQ framework for recursive Transformers that mitigates distribution shift, state reuse, and recursive error accumulation, yielding 68.8% higher average accuracy and 87.7% lower perplexity under W4A4 versus static baselines.

QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models

cs.LG · 2026-02-23 · unverdicted · novelty 7.0

QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memory savings on the quantized components.

Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

SplitQ improves low-bit PTQ for VLMs by isolating modality-specific outlier channels via MOCD and applying dual-branch adaptive calibration via ACC, outperforming prior methods on six datasets across W4A8 to W3A2 settings.

Theory-optimal Quantization Based on Flatness

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

The paper introduces the Flatness metric, derives a theory-optimal quantization solution, and presents BDQ that uses bidirectional diagonal transformations to reduce outlier impact, achieving under 1% drop at W4A4 on LLaMA-3-8B.

CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

CoQuant selects optimal high-precision subspaces for mixed-precision LLM quantization via a closed-form weighted PCA that balances weight and activation covariances derived from expected output error.

TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network

cs.LG · 2025-06-02 · unverdicted · novelty 6.0

TAH-Quant introduces tile-wise adaptive Hadamard quantization for activations in pipeline parallelism, achieving 3-4 bit compression with up to 4.3x throughput speedup and O(1/sqrt(T)) convergence matching SGD.

BTC-LLM: Efficient Sub-1-Bit LLM Quantization via Learnable Transformation and Binary Codebook

cs.LG · 2025-05-24 · conditional · novelty 6.0

BTC-LLM uses a binary codebook for pattern clustering and a learnable transformation to achieve 0.7-1.11 bit LLM quantization while limiting accuracy loss to a few percent on LLaMA and Qwen models.

GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets

cs.LG · 2026-05-18 · unverdicted · novelty 5.0

GAMMA is a post-training framework that learns stable module sensitivity rankings for mixed-precision LLM quantization and projects them to exact bit budgets via integer programming, enabling reuse across arbitrary memory targets.

citing papers explorer

Showing 8 of 8 citing papers.

LoopQ: Quantization for Recursive Transformers cs.LG · 2026-05-08 · unverdicted · none · ref 14
LoopQ provides a loop-aware PTQ framework for recursive Transformers that mitigates distribution shift, state reuse, and recursive error accumulation, yielding 68.8% higher average accuracy and 87.7% lower perplexity under W4A4 versus static baselines.
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models cs.LG · 2026-02-23 · unverdicted · none · ref 11
QuantVLA is the first post-training quantization framework for VLA models that quantizes the diffusion transformer action head and reports higher task success rates than full-precision baselines with roughly 70% memory savings on the quantized components.
Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models cs.CV · 2026-05-19 · unverdicted · none · ref 19
SplitQ improves low-bit PTQ for VLMs by isolating modality-specific outlier channels via MOCD and applying dual-branch adaptive calibration via ACC, outperforming prior methods on six datasets across W4A8 to W3A2 settings.
Theory-optimal Quantization Based on Flatness cs.LG · 2026-05-11 · unverdicted · none · ref 6
The paper introduces the Flatness metric, derives a theory-optimal quantization solution, and presents BDQ that uses bidirectional diagonal transformations to reduce outlier impact, achieving under 1% drop at W4A4 on LLaMA-3-8B.
CoQuant: Joint Weight-Activation Subspace Projection for Mixed-Precision LLMs cs.LG · 2026-04-29 · unverdicted · none · ref 6
CoQuant selects optimal high-precision subspaces for mixed-precision LLM quantization via a closed-form weighted PCA that balances weight and activation covariances derived from expected output error.
TAH-QUANT: Effective Activation Quantization in Pipeline Parallelism over Slow Network cs.LG · 2025-06-02 · unverdicted · none · ref 43
TAH-Quant introduces tile-wise adaptive Hadamard quantization for activations in pipeline parallelism, achieving 3-4 bit compression with up to 4.3x throughput speedup and O(1/sqrt(T)) convergence matching SGD.
BTC-LLM: Efficient Sub-1-Bit LLM Quantization via Learnable Transformation and Binary Codebook cs.LG · 2025-05-24 · conditional · none · ref 10
BTC-LLM uses a binary codebook for pattern clustering and a learnable transformation to achieve 0.7-1.11 bit LLM quantization while limiting accuracy loss to a few percent on LLaMA and Qwen models.
GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets cs.LG · 2026-05-18 · unverdicted · none · ref 49
GAMMA is a post-training framework that learns stable module sensitivity rankings for mixed-precision LLM quantization and projects them to exact bit budgets via integer programming, enabling reuse across arbitrary memory targets.

arXiv preprint arXiv:2501.13987

fields

years

verdicts

representative citing papers

citing papers explorer