SegFold achieves 1.95× geometric-mean speedup over prior SpGEMM accelerators via fine-grained dynamic scheduling and remapping in its Segment dataflow.
In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (Vancouver, BC, Canada) (ASPLOS 2023), Tor M
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
A new partitioning algorithm that provably load-balances arbitrary sparse tensor algebra expressions by generalizing parallel merging to multi-operand, multi-dimensional hierarchical structures, implemented in a compiler framework.
Proposes AROM to shift LtRAM management to the OS by making pages read-only to applications, using CoW faults for writes to simplify DIMM hardware.
CXL-ClusterSim is a full-system simulation framework combining gem5 and SST to model CXL disaggregated memory for pooling and sharing.
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.
Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.
CCCL delivers 1.34-1.94x faster cross-node GPU collectives via CXL memory pooling than 200 Gbps InfiniBand RDMA, with 1.11x LLM training speedup and 2.75x hardware cost reduction.
Equilibria delivers per-container fairness controls and observability for CXL memory tiering, improving production workload performance by up to 52% over Linux TPP while suppressing noisy-neighbor interference.
PRISM introduces a probabilistic performance modeling framework that quantifies guarantees on training time for large-scale distributed systems under runtime variability.