Dynamic replication of predicted overloaded experts in MoE models achieves near-100% GPU utilization and up to 3x faster inference while retaining 90-95% of baseline performance.
Manning, Andrew Ng, and Christopher Potts
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Kernel-based ReLU is approximated by a quadratic polynomial for low-depth homomorphic encryption compatibility, trained on LLM token embeddings and evaluated across DL and transformer settings.
Poodle shows that LLMs can be automatically replaced with cheaper models for recurring tasks to save significant cost and energy without extra user effort.
citing papers explorer
-
Fast MoE Inference via Predictive Prefetching and Expert Replication
Dynamic replication of predicted overloaded experts in MoE models achieves near-100% GPU utilization and up to 3x faster inference while retaining 90-95% of baseline performance.
-
Kernel-Based ReLU Approximation for Homomorphic Encryption-Compatible Privacy-preserving Deep Learning Models
Kernel-based ReLU is approximated by a quadratic polynomial for low-depth homomorphic encryption compatibility, trained on LLM token embeddings and evaluated across DL and transformer settings.
-
Poodle: Seamlessly Scaling Down Large Language Models with Just-in-Time Model Replacement
Poodle shows that LLMs can be automatically replaced with cheaper models for recurring tasks to save significant cost and energy without extra user effort.