Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.
Title resolution pending
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6representative citing papers
CachePrune enables fine-grained, token-level KV cache reuse across LLM requests by masking sensitive segments, eliminating direct side-channel leakage while cutting TTFT by 4.5x and raising hit rates by 44% versus prior coarse-grained methods.
FlashFPS accelerates FPS via candidate/iteration pruning and inter-layer caching, delivering 5.16x GPU speedup and 2.69x on accelerators with negligible accuracy loss.
MTServe achieves up to 3.1x speedup for generative recommendation model serving by using hierarchical caches with host RAM and system optimizations while keeping cache hit ratios above 98.5%.
TOPCELL reformulates standard cell topology optimization as an LLM generative task with GRPO fine-tuning, outperforming base models and matching exhaustive solvers with 85.91x speedup in 2nm/7nm industrial flows.
DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains over baselines.
citing papers explorer
-
FlashFPS: Efficient Farthest Point Sampling for Large-Scale Point Clouds via Pruning and Caching
FlashFPS accelerates FPS via candidate/iteration pruning and inter-layer caching, delivering 5.16x GPU speedup and 2.69x on accelerators with negligible accuracy loss.