Wafer-Scale AI: GPU Impossible Performance

IEEE Computer Society · 2024 · arXiv 1935.2024

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

cs.AI · 2025-11-05 · unverdicted · novelty 7.0

SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.

FireBridge: Cycle-Accurate Hardware + Firmware Co-Verification for Modern Accelerators

cs.AR · 2026-03-26 · conditional · novelty 6.0

FireBridge enables cycle-accurate hardware-firmware co-verification in standard simulators using randomized memory bridges, delivering up to 50x faster debug iterations than FPGA-based flows for accelerators such as systolic arrays and CGRAs.

M100: An Orchestrated Dataflow Architecture Powering General AI Computing

cs.LG · 2026-04-20 · unverdicted · novelty 5.0

M100 is a tensor-based dataflow architecture that eliminates heavy caching through compiler-managed data streams, claiming higher utilization and better performance than GPGPUs for AD and LLM inference tasks.

citing papers explorer

Showing 3 of 3 citing papers.

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators cs.AI · 2025-11-05 · unverdicted · none · ref 18
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
FireBridge: Cycle-Accurate Hardware + Firmware Co-Verification for Modern Accelerators cs.AR · 2026-03-26 · conditional · none · ref 33
FireBridge enables cycle-accurate hardware-firmware co-verification in standard simulators using randomized memory bridges, delivering up to 50x faster debug iterations than FPGA-based flows for accelerators such as systolic arrays and CGRAs.
M100: An Orchestrated Dataflow Architecture Powering General AI Computing cs.LG · 2026-04-20 · unverdicted · none · ref 18
M100 is a tensor-based dataflow architecture that eliminates heavy caching through compiler-managed data streams, claiming higher utilization and better performance than GPGPUs for AD and LLM inference tasks.

Wafer-Scale AI: GPU Impossible Performance

fields

years

verdicts

representative citing papers

citing papers explorer