hub

A Cost-Effective Entangling Prefetcher for Instructions

Qijing Huang, Aravind Kalaiah, Minwoo Kang, James Demmel, Grace Dinh, John Wawrzynek, Thomas Norell, Yakun Sophia Shao · 2021 · arXiv 2012.2021

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

read on arXiv browse 17 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design

cs.AR · 2026-02-16 · unverdicted · novelty 8.0 · 3 refs

TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.

ICP: Exploiting Instruction Correlation for Prefetching Irregular Memory Accesses

cs.AR · 2026-05-15 · unverdicted · novelty 7.0

ICP is a prefetcher that learns stable instruction correlations to speculatively compute future irregular memory accesses, outperforming Triangel by 14% and DMP by 6% with only 2.1 KB storage.

Enhancing Instruction Prefetching via Cache and TLB Management

cs.AR · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.

SPEC CPU: The Next Generation

cs.PF · 2026-05-02 · unverdicted · novelty 7.0

SPEC CPU 2026 presents a new benchmark suite using open-source apps, expanded multithreading, and Rolling-Round-Robin Rate to address gaps in evaluating heterogeneous multiprogrammed CPU performance.

Mestra: Exploring Migration on Virtualized CGRAs

cs.AR · 2026-04-06 · unverdicted · novelty 7.0

Mestra adds multi-tenancy and live migration to CGRAs, cutting workload makespan by up to 70% and tail latency by up to 30% at 0.13% extra LUT cost per region.

Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Design

cs.AR · 2026-02-16 · unverdicted · novelty 7.0 · 3 refs

FFM finds optimal fused mappings for tensor accelerators over 10,000 times faster than prior mappers while cutting energy-delay product by up to 1.8x versus hand-tuned designs.

FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching

cs.AR · 2024-05-21 · unverdicted · novelty 7.0

FEATHER integrates data reordering into its reduction network via a new spatial array (Nest) and multi-stage network (BIRRD) to enable low-overhead dataflow switching in ML accelerators, delivering 1.27-2.89x latency speedup and 1.3-6.43x energy gains versus prior designs at 6% area overhead.

Designing Datacenter Power Delivery Hierarchies for the AI Era

cs.DC · 2026-05-15 · unverdicted · novelty 6.0

Develops a simulation framework showing multi-resource stranding changes deployable capacity and effective costs in AI datacenters, arguing the key metric is deployable capacity over time rather than installed megawatts.

MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces

cs.DC · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.

AME-PIM: Can Memory be Your Next Tensor Accelerator?

cs.AR · 2026-04-30 · unverdicted · novelty 6.0

The paper maps RISC-V AME matrix instructions to HBM-PIM micro-kernels via a PEP-based model and reduction-free outer-product dataflow, achieving up to 14.9 GFLOP/s on Samsung Aquabolt-XL.

DCGen 1.1 Technical Report: Generating Datacenter Configurations (including IT, Power, Cooling)

cs.DC · 2026-03-15 · accept · novelty 6.0

DCGen generates customizable datacenter configurations with IT, power, and cooling components optimized for power, compute, and area targets using real equipment catalogs and workload-specific IT mixes.

DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication

cs.DC · 2025-11-10 · unverdicted · novelty 6.0

DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.

Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Acceleration

cs.AR · 2025-04-24 · unverdicted · novelty 6.0

Fine-grained fusion and adaptive scheduling in SSMs deliver up to 4.8x speedup and 10x lower on-chip memory, enabling a fusion-aware accelerator with 1.78x higher performance than MARCA at equal area.

Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM

cs.CR · 2026-05-19 · unverdicted · novelty 5.0

Real-world PIM on UPMEM accelerates cryptographic algorithms when computation is distributed across multiple DRAM ranks, outperforming CPUs at full scale.

The Landscape of GPU-Centric Communication

cs.DC · 2024-09-15 · unverdicted · novelty 2.0

A survey categorizing vendor mechanisms and user-level libraries for GPU-centric communication within and across nodes, with discussion of benefits, challenges, and open questions.

A complete discussion on fully reconfigurable, digital, scalable, graph and sparsity-aware near-memory accelerator for graph neural networks

cs.AR · 2026-05-19

The EDGE Language: Extended General Einsums for Graph Algorithms

cs.DS · 2024-04-17

citing papers explorer

Showing 17 of 17 citing papers.

The Turbo-Charged Mapper: Fast and Optimal Mapping for Energy-efficient and Low-latency Accelerator Design cs.AR · 2026-02-16 · unverdicted · none · ref 14 · 3 links
TCM finds provably optimal DNN accelerator mappings by pruning the search space up to 32 orders of magnitude with a new dataplacement concept, delivering 1.2-6.5x better energy-delay-product in 17 seconds instead of hours.
ICP: Exploiting Instruction Correlation for Prefetching Irregular Memory Accesses cs.AR · 2026-05-15 · unverdicted · none · ref 45
ICP is a prefetcher that learns stable instruction correlations to speculatively compute future irregular memory accesses, outperforming Triangel by 14% and DMP by 6% with only 2.1 KB storage.
Enhancing Instruction Prefetching via Cache and TLB Management cs.AR · 2026-05-12 · unverdicted · none · ref 1 · 2 links
IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
SPEC CPU: The Next Generation cs.PF · 2026-05-02 · unverdicted · none · ref 174
SPEC CPU 2026 presents a new benchmark suite using open-source apps, expanded multithreading, and Rolling-Round-Robin Rate to address gaps in evaluating heterogeneous multiprogrammed CPU performance.
Mestra: Exploring Migration on Virtualized CGRAs cs.AR · 2026-04-06 · unverdicted · none · ref 25
Mestra adds multi-tenancy and live migration to CGRAs, cutting workload makespan by up to 70% and tail latency by up to 30% at 0.13% extra LUT cost per region.
Fast and Fusiest: An Optimal Fusion-Aware Mapper for Accelerator Design cs.AR · 2026-02-16 · unverdicted · none · ref 24 · 3 links
FFM finds optimal fused mappings for tensor accelerators over 10,000 times faster than prior mappers while cutting energy-delay product by up to 1.8x versus hand-tuned designs.
FEATHER: A Reconfigurable Accelerator with Data Reordering Support for Low-Cost On-Chip Dataflow Switching cs.AR · 2024-05-21 · unverdicted · none · ref 23
FEATHER integrates data reordering into its reduction network via a new spatial array (Nest) and multi-stage network (BIRRD) to enable low-overhead dataflow switching in ML accelerators, delivering 1.27-2.89x latency speedup and 1.3-6.43x energy gains versus prior designs at 6% area overhead.
Designing Datacenter Power Delivery Hierarchies for the AI Era cs.DC · 2026-05-15 · unverdicted · none · ref 71
Develops a simulation framework showing multi-resource stranding changes deployable capacity and effective costs in AI datacenters, arguing the key metric is deployable capacity over time rather than installed megawatts.
MLCommons Chakra: Advancing Performance Benchmarking and Co-design using Standardized Execution Traces cs.DC · 2026-05-11 · unverdicted · none · ref 88 · 2 links
Chakra introduces a standardized graph-based execution trace representation for distributed ML workloads along with supporting tools to enable benchmarking, analysis, generation, and co-design across simulators and hardware.
AME-PIM: Can Memory be Your Next Tensor Accelerator? cs.AR · 2026-04-30 · unverdicted · none · ref 15
The paper maps RISC-V AME matrix instructions to HBM-PIM micro-kernels via a PEP-based model and reduction-free outer-product dataflow, achieving up to 14.9 GFLOP/s on Samsung Aquabolt-XL.
DCGen 1.1 Technical Report: Generating Datacenter Configurations (including IT, Power, Cooling) cs.DC · 2026-03-15 · accept · none · ref 85
DCGen generates customizable datacenter configurations with IT, power, and cooling components optimized for power, compute, and area targets using real equipment catalogs and workload-specific IT mixes.
DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication cs.DC · 2025-11-10 · unverdicted · none · ref 26
DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.
Fine-Grained Fusion: The Missing Piece in Area-Efficient State Space Model Acceleration cs.AR · 2025-04-24 · unverdicted · none · ref 8
Fine-grained fusion and adaptive scheduling in SSMs deliver up to 4.8x speedup and 10x lower on-chip memory, enabling a fusion-aware accelerator with 1.78x higher performance than MARCA at equal area.
Taking Cryptography Out of the Data Path via Near-Memory Processing in DRAM cs.CR · 2026-05-19 · unverdicted · none · ref 49
Real-world PIM on UPMEM accelerates cryptographic algorithms when computation is distributed across multiple DRAM ranks, outperforming CPUs at full scale.
The Landscape of GPU-Centric Communication cs.DC · 2024-09-15 · unverdicted · none · ref 62
A survey categorizing vendor mechanisms and user-level libraries for GPU-centric communication within and across nodes, with discussion of benefits, challenges, and open questions.
A complete discussion on fully reconfigurable, digital, scalable, graph and sparsity-aware near-memory accelerator for graph neural networks cs.AR · 2026-05-19 · unreviewed · ref 11
The EDGE Language: Extended General Einsums for Graph Algorithms cs.DS · 2024-04-17 · unreviewed · ref 45

A Cost-Effective Entangling Prefetcher for Instructions

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer