FalconGEMM delivers a framework with deployment, group-parallel execution, and analytical decision modules that makes lower-complexity matrix multiplication practical, beating cuBLAS and similar libraries by 7.59-17.85% on LLM tasks.
A sample-free compilation framework for efficient dynamic tensor computation
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.DC 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
GICC enables GPU-initiated NIC coordination with asynchronous resource reclamation, delivering up to 229x lower latency and 25% better weak scaling on Slingshot versus prior runtimes.
citing papers explorer
-
FalconGEMM: Surpassing Hardware Peaks with Lower-Complexity Matrix Multiplication
FalconGEMM delivers a framework with deployment, group-parallel execution, and analytical decision modules that makes lower-complexity matrix multiplication practical, beating cuBLAS and similar libraries by 7.59-17.85% on LLM tasks.
-
GICC: A High-Performance Runtime for GPU-Initiated Communication and Coordination in Modern HPC Systems
GICC enables GPU-initiated NIC coordination with asynchronous resource reclamation, delivering up to 229x lower latency and 25% better weak scaling on Slingshot versus prior runtimes.