pith. sign in

arxiv: 1907.02167 · v1 · pith:Z7PS7VSHnew · submitted 2019-07-04 · 💻 cs.AR

To Update or Not To Update?: Bandwidth-Efficient Intelligent Replacement Policies for DRAM Caches

Pith reviewed 2026-05-25 09:17 UTC · model grok-4.3

classification 💻 cs.AR
keywords DRAM cachereplacement policyreuse trackingbandwidth efficiencyRRIPcache bypassstate sampling
0
0 comments X

The pith

Tracking reuse for one line per region makes stateful replacement practical for large DRAM caches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that DRAM caches can use intelligent replacement policies that track reuse state to avoid thrashing, yet avoid the high bandwidth cost of updating that state for every line. It introduces RRIP-AOB to protect high-reuse lines via bypass and aging, then pairs it with ETR so that state from a single sampled line per region guides decisions for the whole region. A sympathetic reader would care because stateless policies prove too coarse for gigascale DRAM caches while full per-line tracking is bandwidth-prohibitive. The result is an 18% speedup on a 2GB DRAM cache using under 1KB of SRAM and 70% less state-update bandwidth than tracking every line.

Core claim

The central claim is that reuse state can be tracked efficiently enough for DRAM caches by sampling only one line per region and using its state to direct replacement and bypass decisions for every line in that region. This enables the RRIP-AOB policy, which tracks high-reuse lines, protects them by bypassing others, and ages their state on bypass, to deliver the hit-rate benefits of stateful policies while keeping bandwidth close to stateless ones.

What carries the argument

Efficient Tracking of Reuse (ETR), which monitors reuse state on one line per region to guide replacement decisions for the remaining lines in the region.

If this is right

  • Stateful replacement policies become bandwidth-viable for DRAM caches instead of being limited to stateless schemes.
  • Common thrashing patterns in gigascale caches are mitigated, raising overall hit rates.
  • Performance improves by 18% on a 2GB DRAM cache while SRAM overhead stays below 1KB.
  • State-tracking bandwidth falls by 70% relative to per-line updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The region-sampling idea could apply to other bandwidth-limited structures such as last-level caches or memory controllers.
  • Workloads with low reuse homogeneity inside regions would likely see smaller gains, suggesting a possible need for adaptive region sizing.
  • Combining ETR with existing hybrid memory or tiered-cache designs could further reduce off-chip traffic in future systems.

Load-bearing premise

That monitoring reuse state for only one line per region supplies sufficiently accurate guidance for replacement decisions across all lines in the region.

What would settle it

A workload in which lines inside the same region show sharply different reuse patterns, when run with ETR, produces hit rates no better than always-install or probabilistic bypass.

Figures

Figures reproduced from arXiv: 1907.02167 by Moinuddin K. Qureshi, Vinson Young.

Figure 1
Figure 1. Figure 1: (a) Always-Install, 90%-Bypass, and Desired replacement policies under mixed high-reuse low-reuse access [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Organization of the DRAM cache used in KNL. DRAM cache is organized at a linesize of 64 bytes, is direct [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Re-Reference Interval Prediction (RRIP). [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of RRIP: Age-On-Bypass (RRIP￾AOB). The transition from one state to another is accom￾plished with replacement-state update operation. Such updates may consume significant bandwidth. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Speedup from different replacement policies over the baseline always-install direct-mapped DRAM cache. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MPKI of baseline DRAM cache and RRIP￾AOB. RRIP-AOB reduces misses by 10%. 4.3 Benefits from Reuse-Based Replacement Intelligent replacement policies improve performance by reducing cache misses [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 10
Figure 10. Figure 10: ETR’s representative-update and bypass￾decision following enables similar RRIP-AOB install pol￾icy, at reduced update bandwidth (dashed box = benefit). To implement representative-update, we first need to pick a stable representative line. Prior work finds the first access to a region is relatively consistent [31]. If we maintain state for just the first conflicting set in a region, we can maintain good r… view at source ↗
Figure 9
Figure 9. Figure 9: Distribution of RRPV of coresident lines on [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance of RRIP-AOB, ETR on RRIP-AOB, and an Ideal RRIP-AOB with no state update costs. Coor [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: Replacement and Install bandwidth consump [PITH_FULL_IMAGE:figures/full_fig_p007_13.png] view at source ↗
Figure 12
Figure 12. Figure 12: Design of Recent-Bypass-Table to enforce [PITH_FULL_IMAGE:figures/full_fig_p007_12.png] view at source ↗
Figure 14
Figure 14. Figure 14: Performance of ETR on RRIP-AOB, ETR on SHiP-AOB, and Ideal SHiP-AOB with no state update costs. [PITH_FULL_IMAGE:figures/full_fig_p008_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Operation and Organization of SHiP. 6.2 Adapting SHiP to Direct-Mapped Cache Conventional SHiP design always installs the incoming line, either with High-Priority or Low-Priority. Unfortunately, with a direct-mapped cache, doing so will degenerate into the Always-Install policy (baseline). We extend SHiP in the con￾text of direct-mapped caches using the option of bypassing with SHiP-AOB. If the resident l… view at source ↗
Figure 16
Figure 16. Figure 16: Bandwidth usage of ETR on RRIP-AOB [left] [PITH_FULL_IMAGE:figures/full_fig_p008_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Performance of set-associative ACCORD, ETR on RRIP-AOB, and ACCORD with ETR on RRIP-AOB. [PITH_FULL_IMAGE:figures/full_fig_p009_17.png] view at source ↗
Figure 19
Figure 19. Figure 19: We first use ACCORD to select which way to [PITH_FULL_IMAGE:figures/full_fig_p009_19.png] view at source ↗
Figure 18
Figure 18. Figure 18: ACCORD enables low-latency associativity [PITH_FULL_IMAGE:figures/full_fig_p009_18.png] view at source ↗
Figure 20
Figure 20. Figure 20: L4 Read-Miss-Per-Kilo-Instruction of Always-Install, ACCORD, RRIP-AOB, and ACCORD + RRIP-AOB. Combination enables 20% miss reduction [PITH_FULL_IMAGE:figures/full_fig_p009_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Speedup of ETR on RRIP-AOB and Ideal RRIP-AOB on multi-programmed workloads. 8.2 Impact on Energy and Power [PITH_FULL_IMAGE:figures/full_fig_p010_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: shows DRAM cache + memory power, energy consumption, and energy-delay-product (EDP) of a system using ETR, normalized to baseline DRAM cache. We model power and energy for stacked DRAM with [38,39], and model power and energy for non-volatile memory with [27]. ETR reduces DRAM cache energy by reducing install and state update bandwidth, and provides lower main memory energy by improving DRAM cache hit-rat… view at source ↗
read the original abstract

This paper investigates intelligent replacement policies for improving the hit-rate of gigascale DRAM caches. Cache replacement policies are commonly used to improve the hit-rate of on-chip caches. The most effective replacement policies often require the cache to track per-line reuse state to inform their decision. A fundamental challenge on DRAM caches, however, is that stateful policies would require significant bandwidth to maintain per-line DRAM cache state. As such, DRAM cache replacement policies have primarily been stateless policies, such as always-install or probabilistic bypass. Unfortunately, we find that stateless policies are often too coarse-grain and become ineffective at the size and associativity of DRAM caches. Ideally, we want a replacement policy that can obtain the hit-rate benefits of stateful replacement policies, but keep the bandwidth-efficiency of stateless policies. In our study, we find that tracking per-line reuse state can enable an effective replacement policy that can mitigate common thrashing patterns seen in gigascale caches. We propose a stateful replacement/bypass policy called RRIP Age-On-Bypass (RRIP-AOB), that tracks reuse state for high-reuse lines, protects such lines by bypassing other lines, and Ages the state On cache Bypass. Unfortunately, such a stateful technique requires significant bandwidth to update state. To this end, we propose Efficient Tracking of Reuse (ETR). ETR makes state tracking efficient by accurately tracking the state of only one line from a region, and using the state of that line to guide the replacement decisions for other lines in that region. ETR reduces the bandwidth for tracking replacement state by 70%, and makes stateful policies practical for DRAM caches. Our evaluations with a 2GB DRAM cache, show that our RRIP-AOB and ETR techniques provide 18% speedup while needing less than 1KB of SRAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes RRIP-AOB, a stateful replacement/bypass policy that tracks reuse state for high-reuse lines and ages state on bypass to mitigate thrashing in gigascale DRAM caches, combined with ETR, which approximates state tracking by monitoring reuse for only one line per region and applying it to guide decisions for the region. This is claimed to deliver the hit-rate benefits of stateful policies while reducing bandwidth by 70% and SRAM overhead to under 1KB. Evaluations on a 2GB DRAM cache report an 18% speedup over baselines.

Significance. If the results hold under rigorous validation, the work would be significant for DRAM cache design by demonstrating a practical way to deploy intelligent, reuse-aware policies at scale without prohibitive bandwidth or storage costs. The ETR approximation directly targets the core tension between statefulness and efficiency in large caches.

major comments (2)
  1. [ETR description and evaluations] The ETR technique (described after RRIP-AOB): the central 18% speedup and 70% bandwidth claims rest on the assumption that reuse state from a single monitored line per region accurately guides replacement/bypass for all lines in that region. No ablation, error quantification, or sensitivity analysis versus full per-line tracking is supplied to bound the approximation error when intra-region reuse distances are heterogeneous, which is common at gigascale associativity and directly undermines the load-bearing claim that ETR preserves RRIP-AOB benefits.
  2. [Abstract and evaluations] Abstract and evaluation sections: the reported 18% speedup and <1KB SRAM figures are presented without any description of the experimental setup, workload list, baseline policies, simulation parameters, or statistical error analysis, preventing verification that the gains are attributable to RRIP-AOB+ETR rather than workload selection or unstated defaults.
minor comments (1)
  1. [Abstract] Abstract: inconsistent capitalization in 'Ages the state On cache Bypass' should be standardized for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [ETR description and evaluations] The ETR technique (described after RRIP-AOB): the central 18% speedup and 70% bandwidth claims rest on the assumption that reuse state from a single monitored line per region accurately guides replacement/bypass for all lines in that region. No ablation, error quantification, or sensitivity analysis versus full per-line tracking is supplied to bound the approximation error when intra-region reuse distances are heterogeneous, which is common at gigascale associativity and directly undermines the load-bearing claim that ETR preserves RRIP-AOB benefits.

    Authors: We agree that the manuscript would be strengthened by including an ablation study, error quantification, and sensitivity analysis for ETR. The current version emphasizes overall benefits but does not explicitly bound approximation error under heterogeneous intra-region reuse. In revision, we will add these analyses comparing ETR to full per-line tracking to demonstrate that benefits are preserved. revision: yes

  2. Referee: [Abstract and evaluations] Abstract and evaluation sections: the reported 18% speedup and <1KB SRAM figures are presented without any description of the experimental setup, workload list, baseline policies, simulation parameters, or statistical error analysis, preventing verification that the gains are attributable to RRIP-AOB+ETR rather than workload selection or unstated defaults.

    Authors: We agree the abstract and evaluations lack sufficient methodological detail. We will revise to expand the abstract with key setup elements and add explicit descriptions of workloads, baselines, parameters, and statistical error analysis in the evaluations section to enable verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims are empirical simulation outcomes independent of policy definitions

full rationale

The paper defines RRIP-AOB and ETR as new replacement/bypass policies motivated by observed thrashing patterns in gigascale DRAM caches. ETR's core design (tracking reuse state for one line per region and applying it to the region) is presented as an engineering approximation to reduce bandwidth, not derived from equations or prior fitted values. The 18% speedup is reported solely from cycle-accurate simulations on a 2GB DRAM cache configuration; these results do not reduce to the policy definitions by construction, nor rely on self-citations for uniqueness theorems or ansatzes. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific fitted parameters or axioms; region granularity and reuse thresholds in ETR and RRIP-AOB function as implicit design choices whose values are not stated.

pith-pipeline@v0.9.0 · 5869 in / 1059 out tokens · 27121 ms · 2026-05-25T09:17:52.589486+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 2 internal anchors

  1. [1]

    INTRODUCTION DRAM caches are important for enabling effective hetero- geneous memory systems that can transparently provide the bandwidth of high bandwidth memories [1], and the capacity of high capacity memories [2, 3]. Designs for DRAM cache organize the tag-store such that the tags can be kept in DRAM (to reduce storage overheads) and yet the tags can ...

  2. [2]

    BACKGROUND AND MOTIV ATION We present the organization of our DRAM cache and dis- cuss the storage and bandwidth constraints that make it chal- lenging to apply intelligent replacement policies. 2.1 Organization of a DRAM Cache (KNL) As the tag storage required for gigascale DRAM caches is large, DRAM cache designs often store tags in DRAM and intelligent...

  3. [3]

    We extend USIMM to include a DRAM cache

    METHODOLOGY 3.1 Framework and Configuration We use USIMM [20], an x86 simulator with detailed mem- ory system model. We extend USIMM to include a DRAM cache. Table 1 shows the configuration used in our study. We model a configuration similar to a Intel Knights Landing (KNL) Sub-NUMA Cluster (one-eighth size). We assume a four-level cache hierarchy (L1, L2, L...

  4. [4]

    4.1 RRIP as a Bypassing Policy We design a version of RRIP for limited-associativity caches, called RRIP: Age-On-Bypass (RRIP-AOB)

    RRIP: AGE-ON-BYPASS If we want to use RRIP on direct-mapped DRAM caches, we have to solve two issues: how do we formulate RRIP as a bypassing policy suitable for caches with limited associativity, and how can we mitigate the state update cost of maintaining per-line reuse state in DRAM. 4.1 RRIP as a Bypassing Policy We design a version of RRIP for limite...

  5. [5]

    We can avoid state update costs if we have an effective way to infer an RRPV state

    EFFICIENT TRACKING OF REUSE Demoting state on every cache bypass incurs significant bandwidth overheads–even if we choose to bypass the line, we still have to spend bandwidth to demote the replacement- state. We can avoid state update costs if we have an effective way to infer an RRPV state. Our design reduces the band- width consumed in performing updates...

  6. [6]

    Hit, follow decision Region ID

  7. [7]

    Miss, make new decision

  8. [8]

    De- motions only occur on first miss to a region

    Update RBTPage C0 A Page A1 C Page B0 Figure 12: Design of Recent-Bypass-Table to enforce coordinated-bypass and coordinated-state-update. De- motions only occur on first miss to a region. Operation of ETR: On cache miss, we index into RBT with Region-ID. If there is an RBT miss, we are currently access- ing the representative first-conflicting-set in a regi...

  9. [9]

    SIGNATURE-BASED POLICIES Thus far, we have discussed AOB and ETR only in the context of RRIP. However, AOB and ETR are actually general techniques that enable formulating direct-mapped versions of replacement policies, as well as reducing the bandwidth needed to maintain replacement policy state. AOB and ETR can make even state-of-the-art signature-based ...

  10. [10]

    A recent proposal ACCORD [34] tries to make DRAM caches set-associative, to improve hit rate albeit at an expense of bandwidth and latency [35,36,37]

    TOW ARDS SET-ASSOCIATIVE DESIGNS We evaluate our solutions in the context of a direct-mapped cache, but our designs and insights can be made applicable to set-associative caches. A recent proposal ACCORD [34] tries to make DRAM caches set-associative, to improve hit rate albeit at an expense of bandwidth and latency [35,36,37]. We compare with the recentl...

  11. [11]

    Due to space constraints, we limit these results to ETR implemented on RRIP-AOB

    RESULTS AND DISCUSSION In this section we present sensitivity studies and storage analysis. Due to space constraints, we limit these results to ETR implemented on RRIP-AOB. 8.1 Multi-programmed Workloads To show robustness of our proposal to multi-programmed workloads, we evaluate over a larger set of 20 mix-application workloads. Figure 21 shows that ETR...

  12. [12]

    Probabilistic replacement policies [17, 43], become probabilistic bypass [8] in Figure 5

    RELATED WORK 9.1 Replacement / Bypassing policies Recency-based replacement policies [16, 41, 42] install in- coming lines at highest priority, which degenerate into always- install baseline. Probabilistic replacement policies [17, 43], become probabilistic bypass [8] in Figure 5. Frequency- based replacement [18, 19, 44, 45, 46] orReuse-based replace- me...

  13. [13]

    We would like to use the most effective replacement policies to improve DRAM cache hit-rate

    CONCLUSION This paper investigates improving hit-rate for direct-mapped DRAM caches by utilizing reuse-based replacement polices. We would like to use the most effective replacement policies to improve DRAM cache hit-rate. Unfortunately, state-of-the- art policies based on reuse are designed to compare multiple counter values within the set to decide a re...

  14. [14]

    High bandwidth memory (hbm) dram,

    J. Standard, “High bandwidth memory (hbm) dram,” JESD235, 2013

  15. [15]

    JEDEC, DDR4 SPEC (JESD79-4), 2013

  16. [16]

    A revolutionary breakthrough in memory technology,

    Intel and Micron, “A revolutionary breakthrough in memory technology,” 2015

  17. [17]

    Knights landing: Second-generation intel xeon phi product,

    A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y .-C. Liu, “Knights landing: Second-generation intel xeon phi product,” IEEE Micro, vol. 36, pp. 34–46, Mar 2016

  18. [18]

    Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design,

    M. K. Qureshi and G. H. Loh, “Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design,” in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 235–246, Dec 2012

  19. [19]

    Bear: Techniques for mitigating bandwidth bloat in gigascale dram caches,

    C. Chou, A. Jaleel, and M. K. Qureshi, “Bear: Techniques for mitigating bandwidth bloat in gigascale dram caches,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA ’15, (New York, NY , USA), pp. 198–210, ACM, 2015

  20. [20]

    Counter-based cache replacement and bypassing algorithms,

    M. Kharbutli and Y . Solihin, “Counter-based cache replacement and bypassing algorithms,” IEEE Trans. Comput., vol. 57, pp. 433–447, Apr. 2008

  21. [21]

    A dueling segmented lru replacement algorithm with adaptive bypassing,

    H. Gao and C. Wilkerson, “A dueling segmented lru replacement algorithm with adaptive bypassing,” in JWAC 2010-1st JILP Worshop on Computer Architecture Competitions: cache replacement Championship, 2010

  22. [22]

    High performance cache replacement using re-reference interval prediction (rrip),

    A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, “High performance cache replacement using re-reference interval prediction (rrip),” in Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, (New York, NY , USA), pp. 60–71, ACM, 2010

  23. [23]

    Ship: Signature-based hit predictor for high performance caching,

    C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr., and J. Emer, “Ship: Signature-based hit predictor for high performance caching,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, (New York, NY , USA), pp. 430–441, ACM, 2011

  24. [24]

    Ship++: Enhancing signature-based hit predictor for improved cache performance,

    V . Young, C.-C. Chou, A. Jaleel, and M. Qureshi, “Ship++: Enhancing signature-based hit predictor for improved cache performance,” in The 2nd Cache Replacement Championship (CRC-2 Workshop in ISCA 2017), 2017

  25. [25]

    Back to the future: Leveraging belady’s algorithm for improved cache replacement,

    A. Jain and C. Lin, “Back to the future: Leveraging belady’s algorithm for improved cache replacement,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 78–89, June 2016

  26. [26]

    Multiperspective reuse prediction,

    D. A. Jiménez and E. Teran, “Multiperspective reuse prediction,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, (New York, NY , USA), pp. 436–448, ACM, 2017

  27. [27]

    Unison cache: A scalable and effective die-stacked dram cache,

    D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi, “Unison cache: A scalable and effective die-stacked dram cache,” in Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pp. 25–37, IEEE, 2014

  28. [28]

    Resilient die-stacked dram caches,

    J. Sim, G. H. Loh, V . Sridharan, and M. O’Connor, “Resilient die-stacked dram caches,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, (New York, NY , USA), pp. 416–427, ACM, 2013

  29. [29]

    Modified lru policies for improving second-level cache behavior,

    W. A. Wong and J.-L. Baer, “Modified lru policies for improving second-level cache behavior,” in High-Performance Computer Architecture, 2000. HPCA-6. Proceedings. Sixth International Symposium on, pp. 49–60, IEEE, 2000

  30. [30]

    Adaptive insertion policies for high performance caching,

    M. K. Qureshi, A. Jaleel, Y . N. Patt, S. C. Steely, and J. Emer, “Adaptive insertion policies for high performance caching,” in Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA ’07, (New York, NY , USA), pp. 381–391, ACM, 2007

  31. [31]

    Data cache management using frequency-based replacement,

    J. T. Robinson and M. V . Devarakonda, “Data cache management using frequency-based replacement,” in Proceedings of the 1990 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’90, (New York, NY , USA), pp. 134–142, ACM, 1990

  32. [32]

    The v-way cache: demand-based associativity via global replacement,

    M. K. Qureshi, D. Thompson, and Y . N. Patt, “The v-way cache: demand-based associativity via global replacement,” in Computer Architecture, 2005. ISCA’05. Proceedings. 32nd International Symposium on, pp. 544–555, IEEE, 2005

  33. [33]

    Usimm: the utah simulated memory module,

    N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi, A. Shafiee, K. Sudan, M. Awasthi, and Z. Chishti, “Usimm: the utah simulated memory module,” University of Utah, Tech. Rep, 2012

  34. [34]

    Knights landing: Second-generation intel xeon phi product,

    A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y . C. Liu, “Knights landing: Second-generation intel xeon phi product,” IEEE Micro, vol. 36, pp. 34–46, Mar 2016

  35. [35]

    Basic performance measurements of the intel optane DC persistent memory module,

    J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y . J. Soh, Z. Wang, Y . Xu, S. R. Dulloor, J. Zhao, and S. Swanson, “Basic performance measurements of the intel optane DC persistent memory module,” CoRR, vol. abs/1903.05714, 2019

  36. [36]

    Fact sheet: New intel architectures and technologies target expanded market opportunities,

    Intel, “Fact sheet: New intel architectures and technologies target expanded market opportunities,” 2018. Accessed: 2019-03-20

  37. [37]

    Phase change memory: From devices to systems,

    M. K. Qureshi, S. Gurumurthi, and B. Rajendran, “Phase change memory: From devices to systems,” Synthesis Lectures on Computer Architecture, vol. 6, no. 4, pp. 1–134, 2011

  38. [38]

    A 20nm 1.8v 8gb pram with 40mb/s program bandwidth,

    Y . Choi, I. Song, M.-H. Park, H. Chung, S. Chang, B. Cho, J. Kim, Y . Oh, D. Kwon, J. Sunwoo, J. Shin, Y . Rho, C. Lee, M.-G. Kang, J. Lee, Y . Kwon, S. Kim, J. Kim, Y .-J. Lee, Q. Wang, S. Cha, S. Ahn, H. Horii, J. Lee, K. Kim, H. Joo, K. Lee, Y .-T. Lee, J. Yoo, and G. Jeong, “A 20nm 1.8v 8gb pram with 40mb/s program bandwidth,” in Solid-State Circuits...

  39. [39]

    Phase change memory,

    H. S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Rajendran, M. Asheghi, and K. E. Goodson, “Phase change memory,” Proceedings of the IEEE, vol. 98, pp. 2201–2227, Dec 2010

  40. [40]

    Architecting phase change memory as a scalable dram alternative,

    B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase change memory as a scalable dram alternative,” in Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, (New York, NY , USA), pp. 2–13, ACM, 2009

  41. [41]

    Pinpointing representative portions of large intel itanium programs with dynamic instrumentation,

    H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi, “Pinpointing representative portions of large intel itanium programs with dynamic instrumentation,” in Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, pp. 81–92, Dec 2004

  42. [42]

    Spec cpu2006 benchmark descriptions,

    J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCH Comput. Archit. News, vol. 34, pp. 1–17, Sept. 2006

  43. [43]

    The GAP Benchmark Suite

    S. Beamer, K. Asanovic, and D. A. Patterson, “The GAP benchmark suite,” CoRR, vol. abs/1508.03619, 2015

  44. [44]

    Spatial memory streaming,

    S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos, “Spatial memory streaming,” in Proceedings of the 33rd Annual International Symposium on Computer Architecture, ISCA ’06, (Washington, DC, USA), pp. 252–263, IEEE Computer Society, 2006

  45. [45]

    Sampling dead block prediction for last-level caches,

    S. M. Khan, Y . Tian, and D. A. Jimenez, “Sampling dead block prediction for last-level caches,” in Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’43, (Washington, DC, USA), pp. 175–186, IEEE Computer Society, 2010

  46. [46]

    Rethinking belady’s algorithm to accommodate prefetching,

    A. Jain and C. Lin, “Rethinking belady’s algorithm to accommodate prefetching,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), June 2018

  47. [47]

    Accord: Enabling associativity for gigascale dram caches by coordinating way-install and way-prediction,

    V . Young, C. Chou, A. Jaleel, and M. K. Qureshi, “Accord: Enabling associativity for gigascale dram caches by coordinating way-install and way-prediction,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 328–339, June 2018

  48. [48]

    Agarwal and S

    A. Agarwal and S. D. Pudar, Column-associative caches: A technique for reducing the miss rate of direct-mapped caches, vol. 21. ACM, 1993

  49. [49]

    Predictive sequential associative cache,

    B. Calder, D. Grunwald, and J. Emer, “Predictive sequential associative cache,” in Proceedings of the 2Nd IEEE Symposium on High-Performance Computer Architecture, HPCA ’96, (Washington, DC, USA), pp. 244–, IEEE Computer Society, 1996

  50. [50]

    Selective cache ways: On-demand cache resource allocation,

    D. H. Albonesi, “Selective cache ways: On-demand cache resource allocation,” in Microarchitecture, 1999. MICRO-32. Proceedings. 32nd Annual International Symposium on, pp. 248–259, IEEE, 1999

  51. [51]

    System and circuit level power modeling of energy-efficient 12 3d-stacked wide i/o drams,

    K. Chandrasekar, C. Weis, B. Akesson, N. Wehn, and K. Goossens, “System and circuit level power modeling of energy-efficient 12 3d-stacked wide i/o drams,” in Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’13, (San Jose, CA, USA), pp. 236–241, EDA Consortium, 2013

  52. [52]

    Rethinking dram power modes for energy proportionality,

    K. T. Malladi, I. Shaeffer, L. Gopalakrishnan, D. Lo, B. C. Lee, and M. Horowitz, “Rethinking dram power modes for energy proportionality,” in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, (Washington, DC, USA), pp. 131–142, IEEE Computer Society, 2012

  53. [53]

    Enabling efficient and scalable hybrid memories using fine-granularity dram cache management,

    J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling efficient and scalable hybrid memories using fine-granularity dram cache management,” IEEE Computer Architecture Letters, vol. 11, pp. 61–64, July 2012

  54. [54]

    Insertion and promotion for tree-based pseudolru last-level caches,

    D. A. Jiménez, “Insertion and promotion for tree-based pseudolru last-level caches,” in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 284–296, ACM, 2013

  55. [55]

    Eelru: simple and effective adaptive page replacement,

    Y . Smaragdakis, S. Kaplan, and P. Wilson, “Eelru: simple and effective adaptive page replacement,” in ACM SIGMETRICS Performance Evaluation Review, vol. 27, pp. 122–133, ACM, 1999

  56. [56]

    Adaptive insertion policies for managing shared caches,

    A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, Jr., and J. Emer, “Adaptive insertion policies for managing shared caches,” in Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, (New York, NY , USA), pp. 208–219, ACM, 2008

  57. [57]

    A fully associative software-managed cache design,

    E. G. Hallnor and S. K. Reinhardt, “A fully associative software-managed cache design,” in Proceedings of the 27th Annual International Symposium on Computer Architecture, ISCA ’00, (New York, NY , USA), pp. 107–116, ACM, 2000

  58. [58]

    The lru-k page replacement algorithm for database disk buffering,

    E. J. O’Neil, P. E. O’Neil, and G. Weikum, “The lru-k page replacement algorithm for database disk buffering,” in Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, SIGMOD ’93, (New York, NY , USA), pp. 297–306, ACM, 1993

  59. [59]

    Lrfu: A spectrum of policies that subsumes the least recently used and least frequently used policies,

    D. Lee, J. Choi, J. H. Kim, S. H. Noh, S. L. Min, Y . Cho, and C. S. Kim, “Lrfu: A spectrum of policies that subsumes the least recently used and least frequently used policies,” IEEE Trans. Comput., vol. 50, pp. 1352–1361, Dec. 2001

  60. [60]

    Improving cache management policies using dynamic reuse distances,

    N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V . Veidenbaum, “Improving cache management policies using dynamic reuse distances,” in Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM International Symposium on, pp. 389–400, IEEE, 2012

  61. [61]

    Cache replacement based on reuse-distance prediction,

    G. Keramidas, P. Petoumenos, and S. Kaxiras, “Cache replacement based on reuse-distance prediction,” in Computer Design, 2007. ICCD

  62. [62]

    245–250, IEEE, 2007

    25th International Conference on, pp. 245–250, IEEE, 2007

  63. [63]

    Candy: Enabling coherent dram caches for multi-node systems,

    C. Chou, A. Jaleel, and M. K. Qureshi, “Candy: Enabling coherent dram caches for multi-node systems,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13, Oct 2016

  64. [64]

    Efficiently enabling conventional block sizes for very large die-stacked dram caches,

    G. H. Loh and M. D. Hill, “Efficiently enabling conventional block sizes for very large die-stacked dram caches,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, (New York, NY , USA), pp. 454–464, ACM, 2011

  65. [65]

    Atcache: reducing dram cache latency via a small sram tag cache,

    C.-C. Huang and V . Nagarajan, “Atcache: reducing dram cache latency via a small sram tag cache,” in Proceedings of the 23rd international conference on Parallel architectures and compilation, pp. 51–60, ACM, 2014

  66. [66]

    Building a low latency, highly associative dram cache with the buffered way predictor,

    Z. Wang, D. A. JimÃl’nez, T. Zhang, G. H. Loh, and Y . Xie, “Building a low latency, highly associative dram cache with the buffered way predictor,” in 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 109–117, Oct 2016

  67. [67]

    Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache,

    D. Jevdjic, S. V olos, and B. Falsafi, “Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, (New York, NY , USA), pp. 404–415, ACM, 2013

  68. [68]

    A fully associative, tagless dram cache,

    Y . Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. W. Lee, “A fully associative, tagless dram cache,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA ’15, (New York, NY , USA), pp. 211–222, ACM, 2015

  69. [69]

    Efficient footprint caching for tagless dram caches,

    H. Jang, Y . Lee, J. Kim, Y . Kim, J. Kim, J. Jeong, and J. W. Lee, “Efficient footprint caching for tagless dram caches,” in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, pp. 237–248, IEEE, 2016

  70. [70]

    Challenges in heterogeneous die-stacked and off-chip memory systems,

    G. H Loh, N. Jayasena, J. Chung, S. K Reinhardt, M. O’Connor, and K. McGrath, “Challenges in heterogeneous die-stacked and off-chip memory systems,” in 3rd Workshop on SoCs, Heterogeneous Architectures and Workloads (SHAW-3), 02 2012

  71. [71]

    Banshee: Bandwidth-efficient dram caching via software/hardware cooperation,

    X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee: Bandwidth-efficient dram caching via software/hardware cooperation,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, (New York, NY , USA), pp. 1–14, ACM, 2017. 13