Hermes: Accelerating long-latency load requests via perceptron-based off-chip load prediction

Yannan Nellie Wu, Po-An Tsai, Angshuman Parashar, Vivienne Sze, Joel S · 2022 · arXiv 6248.2022

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 4

citation-polarity summary

background 4

representative citing papers

Enhancing Instruction Prefetching via Cache and TLB Management

cs.AR · 2026-05-12 · unverdicted · novelty 7.0 · 3 refs

IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.

NasZip: Software and Hardware Co-Design to Accelerate Approximate Nearest Neighbor Search with DIMM-Based Near-Data Processing

cs.AR · 2026-05-21 · conditional · novelty 6.0

NasZip delivers up to 8.4x speedup over CPU baselines and 1.69x over prior NDP accelerators for ANNS by combining near-data processing with statistics-based PCA early exiting, dynamic-float encoding, and data-aware neighbor mapping.

Proxics: an efficient programming model for far memory accelerators

cs.OS · 2026-04-20 · conditional · novelty 6.0

Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.

PG-MDP: Profile-Guided Memory Dependence Prediction for Area-Constrained Cores

cs.PL · 2026-04-09 · unverdicted · novelty 6.0

Profile-guided opcode labeling removes consistently independent loads from the MDP working set, cutting queries 79%, false dependencies 77%, and raising small-core IPC 1.47% on SPEC2017 intspeed.

Learning-Optimized Qubit Mapping and Reuse to Minimize Inter-Core Communication in Modular Quantum Architectures

quant-ph · 2025-06-11 · unverdicted · novelty 6.0

QARMA applies transformer-augmented reinforcement learning to qubit allocation and reuse in modular quantum systems, reporting up to 86% average reduction in inter-core communications versus optimized Qiskit baselines.

The EDGE Language: Extended General Einsums for Graph Algorithms

cs.DS · 2024-04-17 · unverdicted · novelty 6.0

EDGE extends Einsum notation with graph-specific operations to create a unified tensor-algebra framework for expressing and manipulating graph algorithms.

Managing Classical Processing Requirements for Quantum Error Correction

quant-ph · 2024-06-26 · unverdicted · novelty 5.0

A two-level decoder scheduling framework reduces classical processing requirements for quantum error correction by 10-40% on fault-tolerant benchmarks by managing bursty workloads as shared resources.

citing papers explorer

Showing 7 of 7 citing papers.

Enhancing Instruction Prefetching via Cache and TLB Management cs.AR · 2026-05-12 · unverdicted · none · ref 18 · 3 links
IP-CaT jointly optimizes TLB and cache management for L1I prefetching via a translation prefetch buffer and trimodal replacement policy, yielding 8.7% geomean speedup over EPI across 105 server workloads.
NasZip: Software and Hardware Co-Design to Accelerate Approximate Nearest Neighbor Search with DIMM-Based Near-Data Processing cs.AR · 2026-05-21 · conditional · none · ref 51
NasZip delivers up to 8.4x speedup over CPU baselines and 1.69x over prior NDP accelerators for ANNS by combining near-data processing with statistics-based PCA early exiting, dynamic-float encoding, and data-aware neighbor mapping.
Proxics: an efficient programming model for far memory accelerators cs.OS · 2026-04-20 · conditional · none · ref 33
Proxics introduces lightweight virtual processors and low-latency communication channels as portable OS abstractions for programming near-data processing accelerators, demonstrated on real hardware for memory-intensive workloads.
PG-MDP: Profile-Guided Memory Dependence Prediction for Area-Constrained Cores cs.PL · 2026-04-09 · unverdicted · none · ref 26
Profile-guided opcode labeling removes consistently independent loads from the MDP working set, cutting queries 79%, false dependencies 77%, and raising small-core IPC 1.47% on SPEC2017 intspeed.
Learning-Optimized Qubit Mapping and Reuse to Minimize Inter-Core Communication in Modular Quantum Architectures quant-ph · 2025-06-11 · unverdicted · none · ref 32
QARMA applies transformer-augmented reinforcement learning to qubit allocation and reuse in modular quantum systems, reporting up to 86% average reduction in inter-core communications versus optimized Qiskit baselines.
The EDGE Language: Extended General Einsums for Graph Algorithms cs.DS · 2024-04-17 · unverdicted · none · ref 89
EDGE extends Einsum notation with graph-specific operations to create a unified tensor-algebra framework for expressing and manipulating graph algorithms.
Managing Classical Processing Requirements for Quantum Error Correction quant-ph · 2024-06-26 · unverdicted · none · ref 56
A two-level decoder scheduling framework reduces classical processing requirements for quantum error correction by 10-40% on fault-tolerant benchmarks by managing bursty workloads as shared resources.

Hermes: Accelerating long-latency load requests via perceptron-based off-chip load prediction

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer