TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.
hub
A Cost-Effective Entangling Prefetcher for Instructions
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
ICP is a prefetcher that learns stable instruction correlations to speculatively compute future irregular memory accesses, outperforming Triangel by 14% and DMP by 6% with only 2.1 KB storage.
IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
SPEC CPU 2026 presents a new benchmark suite using open-source apps, expanded multithreading, and Rolling-Round-Robin Rate to address gaps in evaluating heterogeneous multiprogrammed CPU performance.
Mestra adds multi-tenancy and live migration to CGRAs, cutting workload makespan by up to 70% and tail latency by up to 30% at 0.13% extra LUT cost per region.
FFM finds optimal fused mappings for tensor accelerators over 10,000 times faster than prior mappers while cutting energy-delay product by up to 1.8x versus hand-tuned designs.
FEATHER integrates data reordering into its reduction network via a new spatial array (Nest) and multi-stage network (BIRRD) to enable low-overhead dataflow switching in ML accelerators, delivering 1.27-2.89x latency speedup and 1.3-6.43x energy gains versus prior designs at 6% area overhead.
Develops a simulation framework showing multi-resource stranding changes deployable capacity and effective costs in AI datacenters, arguing the key metric is deployable capacity over time rather than installed megawatts.
Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.
The paper maps RISC-V AME matrix instructions to HBM-PIM micro-kernels via a PEP-based model and reduction-free outer-product dataflow, achieving up to 14.9 GFLOP/s on Samsung Aquabolt-XL.
DCGen generates customizable datacenter configurations with IT, power, and cooling components optimized for power, compute, and area targets using real equipment catalogs and workload-specific IT mixes.
DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.
Fine-grained fusion and adaptive scheduling in SSMs deliver up to 4.8x speedup and 10x lower on-chip memory, enabling a fusion-aware accelerator with 1.78x higher performance than MARCA at equal area.
Real-world PIM on UPMEM accelerates cryptographic algorithms when computation is distributed across multiple DRAM ranks, outperforming CPUs at full scale.
A survey categorizing vendor mechanisms and user-level libraries for GPU-centric communication within and across nodes, with discussion of benefits, challenges, and open questions.
citing papers explorer
-
The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design
TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.
-
ICP: Exploiting Instruction Correlation for Prefetching Irregular Memory Accesses
ICP is a prefetcher that learns stable instruction correlations to speculatively compute future irregular memory accesses, outperforming Triangel by 14% and DMP by 6% with only 2.1 KB storage.
-
Enhancing Instruction Prefetching via Cache and TLB Management
IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
-
SPEC CPU: The Next Generation
SPEC CPU 2026 presents a new benchmark suite using open-source apps, expanded multithreading, and Rolling-Round-Robin Rate to address gaps in evaluating heterogeneous multiprogrammed CPU performance.
-
Mestra: Exploring Migration on Virtualized CGRAs
Mestra adds multi-tenancy and live migration to CGRAs, cutting workload makespan by up to 70% and tail latency by up to 30% at 0.13% extra LUT cost per region.
-
Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Design
FFM finds optimal fused mappings for tensor accelerators over 10,000 times faster than prior mappers while cutting energy-delay product by up to 1.8x versus hand-tuned designs.
-
FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching
FEATHER integrates data reordering into its reduction network via a new spatial array (Nest) and multi-stage network (BIRRD) to enable low-overhead dataflow switching in ML accelerators, delivering 1.27-2.89x latency speedup and 1.3-6.43x energy gains versus prior designs at 6% area overhead.
-
Designing Datacenter Power Delivery Hierarchies for the AI Era
Develops a simulation framework showing multi-resource stranding changes deployable capacity and effective costs in AI datacenters, arguing the key metric is deployable capacity over time rather than installed megawatts.
-
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces
Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.
-
AME-PIM: Can Memory be Your Next Tensor Accelerator?
The paper maps RISC-V AME matrix instructions to HBM-PIM micro-kernels via a PEP-based model and reduction-free outer-product dataflow, achieving up to 14.9 GFLOP/s on Samsung Aquabolt-XL.
-
DCGen 1.1 Technical Report: Generating Datacenter Configurations (including IT, Power, Cooling)
DCGen generates customizable datacenter configurations with IT, power, and cooling components optimized for power, compute, and area targets using real equipment catalogs and workload-specific IT mixes.
-
DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication
DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.
-
Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Acceleration
Fine-grained fusion and adaptive scheduling in SSMs deliver up to 4.8x speedup and 10x lower on-chip memory, enabling a fusion-aware accelerator with 1.78x higher performance than MARCA at equal area.
-
Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM
Real-world PIM on UPMEM accelerates cryptographic algorithms when computation is distributed across multiple DRAM ranks, outperforming CPUs at full scale.
-
The Landscape of GPU-Centric Communication
A survey categorizing vendor mechanisms and user-level libraries for GPU-centric communication within and across nodes, with discussion of benefits, challenges, and open questions.
- A complete discussion on fully reconfigurable, digital, scalable, graph and sparsity-aware near-memory accelerator for graph neural networks
- The EDGE Language: Extended General Einsums for Graph Algorithms