KernelBench shows that even the best current LLMs generate correct and faster-than-baseline GPU kernels in fewer than 20 percent of realistic ML workloads.
Efficient transformers: A survey
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
representative citing papers
OmniDrop is a training-free layer-wise token pruning framework for omni-modal LLMs that uses query guidance and temporal diversity to reduce prefill latency by up to 40% and memory by 14.7% while improving benchmark scores by up to 3.58 points.
A survey tracing the evolution of state-space models like S4 and Mamba, their efficiency trade-offs, and applications in NLP, vision, and other domains.
citing papers explorer
-
OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance
OmniDrop is a training-free layer-wise token pruning framework for omni-modal LLMs that uses query guidance and temporal diversity to reduce prefill latency by up to 40% and memory by 14.7% while improving benchmark scores by up to 3.58 points.