Apple MPS transformer decoding shows abrupt latency spikes up to 21x in narrow decoding-budget intervals due to KV cache and execution regime shifts, absent on CPU and CUDA.
Advances in Neural Information Processing Systems , volume=
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 4roles
background 1polarities
unclear 1representative citing papers
Standard attention collapses on additively mixed signals because it is memoryless with respect to explained evidence, but adding multiplicative depletion with an attention bias prevents collapse and enables multi-source inference.
MiniGPT is a self-contained PyTorch implementation of standard GPT autoregressive modeling that reaches 1.478 validation loss on Tiny Shakespeare with a 10.77M-parameter model and produces recognizable Shakespeare-style text.
citing papers explorer
-
Non-Monotonic Latency in Apple MPS Decoding: KV Cache Interactions and Execution Regimes
Apple MPS transformer decoding shows abrupt latency spikes up to 21x in narrow decoding-budget intervals due to KV cache and execution regime shifts, absent on CPU and CUDA.
-
When Attention Collapses: Residual Evidence Modeling for Compositional Inference
Standard attention collapses on additively mixed signals because it is memoryless with respect to explained evidence, but adding multiplicative depletion with an attention bias prevents collapse and enables multi-source inference.
-
MiniGPT: Rebuilding GPT from First Principles
MiniGPT is a self-contained PyTorch implementation of standard GPT autoregressive modeling that reaches 1.478 validation loss on Tiny Shakespeare with a 10.77M-parameter model and produces recognizable Shakespeare-style text.
- HyperBones: Realtime Bone-driven Neural Garment Simulation with Hypernetwork Conditioning