SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
Wafer-Scale AI: GPU Impossible Performance
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
FireBridge enables cycle-accurate hardware-firmware co-verification in standard simulators using randomized memory bridges, delivering up to 50x faster debug iterations than FPGA-based flows for accelerators such as systolic arrays and CGRAs.
M100 is a tensor-based dataflow architecture that eliminates heavy caching through compiler-managed data streams, claiming higher utilization and better performance than GPGPUs for AD and LLM inference tasks.
citing papers explorer
-
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
-
FireBridge: Cycle-Accurate Hardware + Firmware Co-Verification for Modern Accelerators
FireBridge enables cycle-accurate hardware-firmware co-verification in standard simulators using randomized memory bridges, delivering up to 50x faster debug iterations than FPGA-based flows for accelerators such as systolic arrays and CGRAs.
-
M100: An Orchestrated Dataflow Architecture Powering General AI Computing
M100 is a tensor-based dataflow architecture that eliminates heavy caching through compiler-managed data streams, claiming higher utilization and better performance than GPGPUs for AD and LLM inference tasks.