GEMQ applies global LP-based expert importance estimation and router fine-tuning within progressive quantization to cut memory and speed inference in MoE LLMs with little accuracy loss.
Dynamo: Runtime switchable quantization for moe with cross-dataset adaptation.arXiv preprint arXiv:2503.21135
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
GSQ uses Gumbel-Softmax to optimize scalar quantization grids for LLMs, closing most of the accuracy gap to vector methods like QTIP at 2-3 bits per parameter while using symmetric scalar grids compatible with existing kernels.
DynaGraph is a multi-model framework that multiplexes PEFT adapters on a shared base model with evaluator-driven dynamic topology reconfiguration and hierarchical self-healing to achieve near-72B performance on reasoning benchmarks using an 8B model while reducing latency and tokens.
citing papers explorer
-
GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs
GEMQ applies global LP-based expert importance estimation and router fine-tuning within progressive quantization to cut memory and speed inference in MoE LLMs with little accuracy loss.