pith. sign in

arxiv: 2605.15645 · v1 · submitted 2026-05-15 · 💻 cs.AR

ICP: Exploiting Instruction Correlation for Prefetching Irregular Memory Accesses

Pith reviewed 2026-05-19 19:43 UTC · model grok-4.3

classification 💻 cs.AR
keywords prefetchingirregular memory accessesinstruction correlationhardware prefetchertemporal prefetchingindirect memory accessdata prefetchingmemory hierarchy
0
0 comments X p. Extension

The pith

ICP prefetching exploits stable instruction correlations to handle irregular memory accesses more efficiently than address-based methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Instruction-Correlation Prefetching as a way to prefetch irregular memory accesses without depending on repeating addresses. It establishes that instructions often recur with consistent data-dependency links even when the addresses they produce do not repeat temporally. By learning these links, the mechanism computes and prefetches future accesses from the results of earlier correlated instructions. A reader would care because the approach delivers measurable speedups on irregular workloads while using orders of magnitude less on-chip storage than prior temporal or indirect prefetchers.

Core claim

ICP observes that although memory addresses may not repeat, the instructions generating them often recur with stable data-dependency relationships. By learning these persistent instruction correlations, ICP speculatively computes and prefetches future irregular accesses using the execution results of their correlated predecessors. Across irregular SPEC CPU and GAP benchmarks, this yields 14 percent higher performance than the temporal prefetcher Triangel and 6 percent higher than the indirect prefetcher DMP while using only 2.1 KB of hardware storage.

What carries the argument

Instruction-Correlation Prefetching (ICP), a hardware mechanism that learns recurring instruction-level correlations with stable data dependencies to speculatively compute and prefetch future irregular memory accesses from predecessor results.

If this is right

  • Performance improves by 14 percent over state-of-the-art temporal prefetchers on irregular SPEC CPU and GAP benchmarks.
  • Storage overhead drops to 2.1 KB, more than three orders of magnitude below typical temporal prefetcher tables.
  • Indirect memory accesses become practical to prefetch without relying on address recurrence.
  • The same correlation-tracking logic could be applied to other irregular patterns that defeat address-history methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hardware designs with tight area budgets for metadata tables may favor instruction-correlation tracking over large address-history buffers.
  • Similar dependency-correlation ideas could extend to other front-end optimizations such as value prediction or load scheduling.
  • Combining ICP with lightweight address-based prefetchers might yield further gains on mixed regular and irregular code.

Load-bearing premise

The assumption that instructions generating irregular addresses recur with stable data-dependency relationships even when the addresses themselves show weak or no temporal recurrence.

What would settle it

Running ICP on a workload whose instructions lack stable correlations, such as one with fully randomized address generation, and observing zero or negative performance change would show the claim does not hold.

Figures

Figures reproduced from arXiv: 2605.15645 by Buqing Xu, Ceyu Xu, Chenlu Miao, Mengming Li, Qijun Zhang, Shang Liu, Wenkai Li, Xiangfeng Sun, Yuan Xie, Zhiyao Xie.

Figure 1
Figure 1. Figure 1: Comparison of the maximum amount of metadata stored by ICP and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison between instruction-level correlations and memory [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ICP Overview. instruction correlations in a lightweight table and leverages this information to prefetch irregular memory accesses [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PC Selector and Classifier. track the performance counters. The Sample Table is indexed by the PC address of memory access instructions. When a PC experiences a prefetch hit or a demand miss, ICP increments the corresponding counter value in the Sample Table. To retain frequently accessed PCs, the Sample Table is managed using an LRU replacement policy. Due to distinct memory access patterns at different c… view at source ↗
Figure 5
Figure 5. Figure 5: Process of dependency tree construction. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Correlation Table structure and usage. C. PC Correlation Recording ICP employs the Correlation Table to record the instruction correlations identified in Section IV-B [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: IPC speedup on SPEC and GAP benchmarks. ICP outperforms Triangel by 13.99%, Tyche by 10.86%, DMP by 5.97%, VR by 15.03%, and DVR by 3.74%. While ICP attains performance comparable to the combined DMP+Triangel configuration, it provides several key advantages: 1) Substantially smaller storage overhead; 2) Lower hardware complexity; 3) Lower DRAM traffic; 4) Higher energy efficiency. SPEC GAP Geomean 1.0 1.1… view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of DRAM traffic. ICP introduces 6.84% less DRAM traffic than the DMP+Triangel hybrid configuration, 8.02% than the DVR. hybrid configuration inherits the limitations of both prefetch￾ing designs. First, the hybrid design incurs significantly greater hardware overhead than ICP. Specifically, Triangel and other temporal prefetchers typically require substantially larger metadata storage (up to 1MB… view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of identified instruction correlations between ICP and indirect prefetcher. ICP can identify and handle more general correlations. as data dependencies in array-of-pointers structures (e.g., p[i] → ∗p[i]). However, not all such patterns can be effectively handled by DMP, since its prefetching mechanism is explicitly designed for nested array structures. To highlight this limi￾tation, [PITH_FUL… view at source ↗
Figure 13
Figure 13. Figure 13: The dependency path length distribution. ICP completes the majority [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Performance impact of Correlation Table size. SPEC GAP (a) w/o Data Extractor 1.0 1.2 1.4 1.6 Performance Speedup SPEC GAP (b) w/o Source Predictor SPEC GAP (c) w/o Demand Trigger Baseline Ablated Complete ICP [PITH_FULL_IMAGE:figures/full_fig_p012_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Ablation study of ICP. of no more than three instructions. This indicates that the majority of the speculation process initiated by ICP (i.e., the time from receiving cache responses from PCpre to issuing prefetch requests for PCsuc) requires the execution of fewer than 3 instructions. Additionally, the longest dependency path identified by ICP consists of only 13 instructions. Since ICP handles only basi… view at source ↗
Figure 16
Figure 16. Figure 16: Sensitivity study for temporal prefetcher degree and table size. formance on SPEC, due to Streamline’s efficient metadata compression, while it degrades on GAP. The degradation also stems from the compressed metadata, which binds multiple consecutive metadata targets with a single lookup address. GAP exhibits few memory-address repetition, making gen￾erating multiple prefetches within one lookup less reli… view at source ↗
read the original abstract

Irregular memory accesses pose challenges for effective and efficient data prefetching. While temporal prefetchers have recently shown promise for irregular memory access patterns, their effectiveness fundamentally depends on temporal address recurrence and large metadata storage. When memory addresses exhibit weak or no recurrence, as in indirect memory accesses, temporal prefetchers achieve limited performance gains while incurring substantial storage overhead. This paper proposes Instruction-Correlation Prefetching (ICP), a new hardware prefetching mechanism that exploits instruction-level correlations rather than memory-address correlations to handle irregular memory accesses. ICP observes that although memory addresses may not repeat, the instructions generating them often recur with stable data-dependency relationships. By learning these persistent instruction correlations, ICP speculatively computes and prefetches future irregular accesses using the execution results of their correlated predecessors. Across irregular SPEC CPU and GAP benchmarks, ICP outperforms the state-of-the-art temporal prefetcher Triangel by 14.0% and the indirect prefetcher DMP by 6.0%, while requiring only 2.1 KB of hardware storage, over three orders of magnitude smaller than temporal prefetchers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Instruction-Correlation Prefetching (ICP), a hardware prefetcher for irregular memory accesses that learns stable data-dependency relationships among recurring instructions rather than relying on temporal address recurrence. By speculatively computing future addresses from correlated predecessor instructions, ICP targets cases where address recurrence is weak. On irregular SPEC CPU and GAP benchmarks, it reports 14% speedup over the temporal prefetcher Triangel and 6% over the indirect prefetcher DMP, using only 2.1 KB of storage.

Significance. If the empirical results and underlying stability assumption hold under broader conditions, ICP would represent a meaningful advance by achieving competitive prefetching gains for irregular accesses at storage costs three orders of magnitude below temporal schemes. This could influence future prefetcher designs in processors handling pointer-chasing or graph workloads.

major comments (2)
  1. [§5] §5 (Evaluation): The central performance claims rest on the premise that instruction data-dependency relationships remain stable enough for speculative address computation. However, the reported results use fixed benchmark inputs on SPEC CPU and GAP; no experiments vary inputs, induce phase changes, or employ adaptive data structures to test whether correlations break and accuracy collapses, leaving the generalization of the 14% and 6% gains unsupported.
  2. [§3] §3 (ICP Design): The description of how ICP detects and records instruction correlations lacks a quantitative metric or threshold for 'stable' relationships. Without this, it is unclear how the mechanism distinguishes persistent dependencies from transient ones that would produce incorrect prefetches.
minor comments (2)
  1. [Abstract] Abstract and §1: The storage figure of 2.1 KB is stated without reference to the exact table or configuration that produces it; add a cross-reference.
  2. [Figure 3] Figure 3: Axis labels and legend are too small for readability; enlarge fonts and ensure all prefetcher variants are clearly distinguished.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below with point-by-point responses. We agree that additional validation of correlation stability and explicit design details would strengthen the paper and will incorporate revisions accordingly.

read point-by-point responses
  1. Referee: [§5] §5 (Evaluation): The central performance claims rest on the premise that instruction data-dependency relationships remain stable enough for speculative address computation. However, the reported results use fixed benchmark inputs on SPEC CPU and GAP; no experiments vary inputs, induce phase changes, or employ adaptive data structures to test whether correlations break and accuracy collapses, leaving the generalization of the 14% and 6% gains unsupported.

    Authors: We acknowledge that our current evaluation relies on standard fixed inputs for the SPEC CPU and GAP benchmarks. The instruction correlations targeted by ICP arise from program structure and data dependencies that tend to persist across inputs for the same algorithmic kernels. Nevertheless, we agree that direct testing under input variation and phase changes would better support generalization of the reported speedups. We will add such experiments to the revised manuscript, including runs with alternate inputs and induced phase shifts where feasible, to quantify correlation stability and any resulting impact on prefetch accuracy. revision: yes

  2. Referee: [§3] §3 (ICP Design): The description of how ICP detects and records instruction correlations lacks a quantitative metric or threshold for 'stable' relationships. Without this, it is unclear how the mechanism distinguishes persistent dependencies from transient ones that would produce incorrect prefetches.

    Authors: We appreciate this observation. Section 3 explains that ICP records an instruction correlation only after the pair has been observed repeatedly with matching data-dependency outcomes. This repeated observation serves as the filter against transient relationships. We will revise the manuscript to state the quantitative threshold explicitly (minimum observation count) and describe the consistency check used to discard transient dependencies, thereby clarifying how the design avoids issuing prefetches based on unstable correlations. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical hardware design with no self-referential derivations

full rationale

The paper proposes ICP as a prefetching mechanism that learns persistent instruction correlations to speculatively compute irregular addresses. No equations, fitted parameters, or derivation steps appear in the abstract or description that reduce any claim to its own inputs by construction. Performance results are presented as direct benchmark outcomes (14% over Triangel, 6% over DMP) rather than predictions derived from self-citations or ansatzes. The central premise rests on an empirical observation about instruction recurrence, which is externally falsifiable via benchmarks and does not rely on load-bearing self-citations or uniqueness theorems from prior author work. This is a standard self-contained design paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that instruction correlations remain stable for irregular accesses; no free parameters or invented entities are identifiable from the abstract alone.

axioms (1)
  • domain assumption Instructions generating irregular memory addresses recur with stable data-dependency relationships.
    This premise is invoked to justify learning correlations instead of addresses when temporal recurrence is weak.

pith-pipeline@v0.9.0 · 5747 in / 1165 out tokens · 44354 ms · 2026-05-19T19:43:27.391209+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 1 internal anchor

  1. [1]

    Gate equivalent,

    “Gate equivalent,” https://en.wikipedia.org/wiki/Gate equivalent

  2. [2]

    SPEC CPU 2006,

    “SPEC CPU 2006,” https://www.spec.org/cpu2006/

  3. [3]

    Ghostminion: A strictness-ordered cache system for spectre mitigation,

    S. Ainsworth, “Ghostminion: A strictness-ordered cache system for spectre mitigation,” inMICRO, 2021

  4. [4]

    Graph prefetching using data structure knowledge,

    S. Ainsworth and T. M. Jones, “Graph prefetching using data structure knowledge,” inICS, 2016, pp. 1–11

  5. [5]

    Software prefetching for indirect memory accesses,

    S. Ainsworth and T. M. Jones, “Software prefetching for indirect memory accesses,” inCGO, 2017, pp. 305–317

  6. [6]

    Software prefetching for indirect memory accesses: A microarchitectural perspective,

    S. Ainsworth and T. M. Jones, “Software prefetching for indirect memory accesses: A microarchitectural perspective,”TOCS, pp. 1–34, 2019

  7. [7]

    Triangel: A high-performance, accu- rate, timely on-chip temporal prefetcher,

    S. Ainsworth and L. Mukhanov, “Triangel: A high-performance, accu- rate, timely on-chip temporal prefetcher,” inISCA, 2024

  8. [8]

    Lightweight ml-based runtime prefetcher selection on many-core platforms,

    E. S. Alcorta, M. Madhav, S. Tetrick, N. J. Yadwadkar, and A. Gerst- lauer, “Lightweight ml-based runtime prefetcher selection on many-core platforms,”arXiv preprint arXiv:2307.08635, 2023

  9. [9]

    An effective on-chip preloading scheme to reduce data access penalty,

    J.-L. Baer and T.-F. Chen, “An effective on-chip preloading scheme to reduce data access penalty,” inProceedings of the 1991 ACM/IEEE conference on Supercomputing, 1991, pp. 176–186

  10. [10]

    Domino temporal data prefetcher,

    M. Bakhshalipour, P. Lotfi-Kamran, and H. Sarbazi-Azad, “Domino temporal data prefetcher,” inHPCA, 2018, pp. 131–142

  11. [11]

    Bingo spatial data prefetcher,

    M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran, and H. Sarbazi- Azad, “Bingo spatial data prefetcher,” inHPCA, 2019, pp. 399–411

  12. [12]

    The GAP Benchmark Suite

    S. Beamer, K. Asanovi ´c, and D. Patterson, “The gap benchmark suite,” arXiv preprint arXiv:1508.03619, 2015

  13. [13]

    Dspatch: Dual spatial pattern prefetcher,

    R. Bera, A. V . Nori, O. Mutlu, and S. Subramoney, “Dspatch: Dual spatial pattern prefetcher,” inMICRO, 2019, pp. 531–544

  14. [14]

    The gem5 simulator,

    N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashtiet al., “The gem5 simulator,”ACM SIGARCH computer architecture news, 2011

  15. [15]

    Software prefetching,

    D. Callahan, K. Kennedy, and A. Porterfield, “Software prefetching,” ACM SIGARCH Computer Architecture News, vol. 19, no. 2, pp. 40– 52, 1991

  16. [16]

    Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches,

    M. Chaudhuri, J. Gaur, N. Bashyam, S. Subramoney, and J. Nuzman, “Introducing hierarchy-awareness in replacement and bypass algorithms for last-level caches,” inPACT, 2012, pp. 293–304

  17. [17]

    Data access microarchitectures for superscalar processors with compiler-assisted data prefetching,

    W. Y . Chen, S. A. Mahlke, P. P. Chang, and W.-m. W. Hwu, “Data access microarchitectures for superscalar processors with compiler-assisted data prefetching,” inMICRO, 1991, pp. 69–73

  18. [18]

    Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors,

    F. Dahlgren and P. Stenstrom, “Effectiveness of hardware-based stride and sequential prefetching in shared-memory multiprocessors,” in HPCA, 1995, pp. 68–77

  19. [19]

    Differential- matching prefetcher for indirect memory access,

    G. Fu, T. Xia, Z. Luo, R. Chen, W. Zhao, and P. Ren, “Differential- matching prefetcher for indirect memory access,” inHPCA, 2024, pp. 439–453

  20. [20]

    Magellan: A high-performance loop-guided prefetcher for indirect memory access,

    G. Fu, T. Xia, M. Yin, P. J. Nair, M. Lis, and P. Ren, “Magellan: A high-performance loop-guided prefetcher for indirect memory access,” inISCA, 2025, pp. 601–615

  21. [21]

    Micro-armed bandit: Lightweight & reusable reinforcement learning for microarchitecture decision-making,

    G. Gerogiannis and J. Torrellas, “Micro-armed bandit: Lightweight & reusable reinforcement learning for microarchitecture decision-making,” inMICRO, 2023

  22. [22]

    Compiler- directed data prefetching in multiprocessors with memory hierarchies,

    E. H. Gornish, E. D. Granston, and A. V . Veidenbaum, “Compiler- directed data prefetching in multiprocessors with memory hierarchies,” inICS, 1990, pp. 128–142

  23. [23]

    Dsdp: Dual stream data prefetcher,

    M. He, H. Wang, K. Zhou, K. Cui, H. Yan, C. Guo, and R. He, “Dsdp: Dual stream data prefetcher,” inPACT, 2022, pp. 372–383

  24. [24]

    1.1 computing’s energy problem (and what we can do about it),

    M. Horowitz, “1.1 computing’s energy problem (and what we can do about it),” inISSCC, 2014, pp. 10–14

  25. [25]

    Memory prefetching using adaptive stream detec- tion,

    I. Hur and C. Lin, “Memory prefetching using adaptive stream detec- tion,” inMICRO, 2006, pp. 397–408

  26. [26]

    Access map pattern matching for data cache prefetch,

    Y . Ishii, M. Inaba, and K. Hiraki, “Access map pattern matching for data cache prefetch,” inICS, 2009, pp. 499–500

  27. [27]

    Linearizing irregular memory accesses for improved correlated prefetching,

    A. Jain and C. Lin, “Linearizing irregular memory accesses for improved correlated prefetching,” inMICRO, 2013, pp. 247–259

  28. [28]

    Apt-get: Profile-guided timely software prefetching,

    S. Jamilan, T. A. Khan, G. Ayers, B. Kasikci, and H. Litz, “Apt-get: Profile-guided timely software prefetching,” inEurosys, 2022, pp. 747– 764

  29. [29]

    Merging similar patterns for hardware prefetching,

    S. Jiang, Q. Yang, and Y . Ci, “Merging similar patterns for hardware prefetching,” inMICRO, 2022, pp. 1012–1026

  30. [30]

    Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers,

    N. P. Jouppi, “Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers,”ACM SIGARCH Computer Architecture News, vol. 18, no. 2SI, pp. 364–373, 1990

  31. [31]

    Dmon: Ef- ficient detection and correction of data locality problems using selective profiling,

    T. A. Khan, I. Neal, G. Pokam, B. Mozafari, and B. Kasikci, “Dmon: Ef- ficient detection and correction of data locality problems using selective profiling,” inOSDI, 2021, pp. 163–181

  32. [32]

    Path confidence based lookahead prefetching,

    J. Kim, S. H. Pugsley, P. V . Gratz, A. N. Reddy, C. Wilkerson, and Z. Chishti, “Path confidence based lookahead prefetching,” inMICRO, 2016, pp. 1–12

  33. [33]

    Stride-directed prefetching for sec- ondary caches,

    S. Kim and A. V . Veidenbaum, “Stride-directed prefetching for sec- ondary caches,” inICPP, 1997, pp. 314–321

  34. [34]

    Division of labor: A more effective approach to prefetching,

    S. Kondguli and M. Huang, “Division of labor: A more effective approach to prefetching,” inISCA, 2018

  35. [35]

    Profile- guided temporal prefetching,

    M. Li, Q. Zhang, Y . Gao, W. Fang, Y . Lu, Y . Ren, and Z. Xie, “Profile- guided temporal prefetching,” inISCA, 2025

  36. [36]

    Integrating prefetcher selection with dynamic request allocation improves prefetching efficiency,

    M. Li, Q. Zhang, Y . Ren, and Z. Xie, “Integrating prefetcher selection with dynamic request allocation improves prefetching efficiency,” in HPCA, 2025

  37. [37]

    Best-offset hardware prefetching,

    P. Michaud, “Best-offset hardware prefetching,” inHPCA, 2016, pp. 469–480

  38. [38]

    Cacti 6.0: A tool to model large caches,

    N. Muralimanohar, R. Balasubramonian, and N. P. Jouppi, “Cacti 6.0: A tool to model large caches,”HP laboratories, vol. 27, p. 28, 2009

  39. [39]

    Vector runahead,

    A. Naithani, S. Ainsworth, T. M. Jones, and L. Eeckhout, “Vector runahead,” inISCA, 2021, pp. 195–208

  40. [40]

    Decoupled vector runahead,

    A. Naithani, J. Roelandts, S. Ainsworth, T. M. Jones, and L. Eeckhout, “Decoupled vector runahead,” inMICRO, 2023, pp. 17–31

  41. [41]

    Berti: an accurate local-delta data prefetcher,

    A. Navarro-Torres, B. Panda, J. Alastruey-Bened ´e, P. Ib´a˜nez, V . Vi˜nals- Y´ufera, and A. Ros, “Berti: an accurate local-delta data prefetcher,” in MICRO, 2022, pp. 975–991

  42. [42]

    Data cache prefetching using a global history buffer,

    K. J. Nesbit and J. E. Smith, “Data cache prefetching using a global history buffer,” inHPCA, 2004, pp. 96–96

  43. [43]

    Bouquet of instruction pointers: Instruction pointer classifier-based spatial hardware prefetching,

    S. Pakalapati and B. Panda, “Bouquet of instruction pointers: Instruction pointer classifier-based spatial hardware prefetching,” inISCA, 2020

  44. [44]

    Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers,

    S. H. Pugsley, Z. Chishti, C. Wilkerson, P.-f. Chuang, R. L. Scott, A. Jaleel, S.-L. Lu, K. Chow, and R. Balasubramonian, “Sandbox prefetching: Safe run-time evaluation of aggressive prefetchers,” in HPCA, 2014, pp. 626–637

  45. [45]

    A cost-effective entangling prefetcher for instructions,

    A. Ros and A. Jimborean, “A cost-effective entangling prefetcher for instructions,” inProceedings of the 48th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2021, pp. 99–111, entangling instruction prefetcher that selects triggering instruction pairs for timely prefetching. [Online]. Available: https: //doi.org/10.1109/ISCA52012.2021.00099

  46. [46]

    Automatically characterizing large scale program behavior,

    T. Sherwood, E. Perelman, G. Hamerly, and B. Calder, “Automatically characterizing large scale program behavior,”ACM SIGPLAN Notices, pp. 45–57, 2002

  47. [47]

    Efficiently prefetching complex address pat- terns,

    M. Shevgoor, S. Koladiya, R. Balasubramonian, C. Wilkerson, S. H. Pugsley, and Z. Chishti, “Efficiently prefetching complex address pat- terns,” inMICRO, 2015, pp. 141–152

  48. [48]

    Spatial memory streaming,

    S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos, “Spatial memory streaming,”ACM SIGARCH Computer Architecture News, pp. 252–263, 2006

  49. [49]

    Criticality-based optimizations for efficient load processing,

    S. Subramaniam, A. Bracy, H. Wang, and G. H. Loh, “Criticality-based optimizations for efficient load processing,” inHPCA, 2009

  50. [50]

    Prodigy: Improving the memory latency of data-indirect irregular workloads using hardware- software co-design,

    N. Talati, K. May, A. Behroozi, Y . Yang, K. Kaszyk, C. Vasiladiotis, T. Verma, L. Li, B. Nguyen, J. Sunet al., “Prodigy: Improving the memory latency of data-indirect irregular workloads using hardware- software co-design,” inHPCA, 2021, pp. 654–667

  51. [51]

    Practical off-chip meta-data for temporal memory streaming,

    T. F. Wenisch, M. Ferdman, A. Ailamaki, B. Falsafi, and A. Moshovos, “Practical off-chip meta-data for temporal memory streaming,” inHPCA, 2009, pp. 79–90

  52. [52]

    Ship: Signature- based hit predictor for high performance caching,

    C. J. Wu, G. H. Loh, M. D. Smith, and D. M. Brooks, “Ship: Signature- based hit predictor for high performance caching,” inMICRO, 2011

  53. [53]

    Practical temporal prefetching with compressed on-chip metadata,

    H. Wu, K. Nathella, M. Pabst, D. Sunwoo, A. Jain, and C. Lin, “Practical temporal prefetching with compressed on-chip metadata,” IEEE Transactions on Computers, pp. 2858–2871, 2021

  54. [54]

    Temporal prefetching without the off-chip metadata,

    H. Wu, K. Nathella, J. Pusdesris, D. Sunwoo, A. Jain, and C. Lin, “Temporal prefetching without the off-chip metadata,” inMICRO, 2019, pp. 996–1008

  55. [55]

    Efficient metadata management for irregular data prefetching,

    H. Wu, K. Nathella, D. Sunwoo, A. Jain, and C. Lin, “Efficient metadata management for irregular data prefetching,” inISCA, 2019, pp. 449–461. 14

  56. [56]

    Hitting the memory wall: Implications of the obvious,

    W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications of the obvious,”ACM SIGARCH computer architecture news, vol. 23, no. 1, pp. 20–24, 1995

  57. [57]

    Tyche: An efficient and general prefetcher for indirect memory accesses,

    F. Xue, C. Han, X. Li, J. Wu, T. Zhang, T. Liu, Y . Hao, Z. Du, Q. Guo, and F. Zhang, “Tyche: An efficient and general prefetcher for indirect memory accesses,”ACM Transactions on Architecture and Code Optimization, pp. 1–26, 2024

  58. [58]

    Imp: Indirect memory prefetcher,

    X. Yu, C. J. Hughes, N. Satish, and S. Devadas, “Imp: Indirect memory prefetcher,” inMICRO, 2015, pp. 178–190

  59. [59]

    Resemble: reinforced ensemble framework for data prefetching,

    P. Zhang, R. Kannan, A. Srivastava, A. V . Nori, and V . K. Prasanna, “Resemble: reinforced ensemble framework for data prefetching,” inSC, 2022

  60. [60]

    Rpg2: Robust profile-guided runtime prefetch generation,

    Y . Zhang, N. Sobotka, S. Park, S. Jamilan, T. A. Khan, B. Kasikci, G. A. Pokam, H. Litz, and J. Devietti, “Rpg2: Robust profile-guided runtime prefetch generation,” inASPLOS, 2024, pp. 999–1013. 15