To Update or Not To Update?: Bandwidth-Efficient Intelligent Replacement Policies for DRAM Caches

Moinuddin K. Qureshi; Vinson Young

REVIEW 2 major objections 1 minor 71 references

Tracking reuse for one line per region makes stateful replacement practical for large DRAM caches.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-25 09:17 UTC pith:Z7PS7VSH

load-bearing objection RRIP-AOB plus ETR brings stateful replacement to DRAM caches with low bandwidth cost, but the single-line-per-region approximation in ETR is the part that still needs evidence. the 2 major comments →

arxiv 1907.02167 v1 pith:Z7PS7VSH submitted 2019-07-04 cs.AR

To Update or Not To Update?: Bandwidth-Efficient Intelligent Replacement Policies for DRAM Caches

Vinson Young , Moinuddin K. Qureshi This is my paper

classification cs.AR

keywords DRAM cachereplacement policyreuse trackingbandwidth efficiencyRRIPcache bypassstate sampling

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that DRAM caches can use intelligent replacement policies that track reuse state to avoid thrashing, yet avoid the high bandwidth cost of updating that state for every line. It introduces RRIP-AOB to protect high-reuse lines via bypass and aging, then pairs it with ETR so that state from a single sampled line per region guides decisions for the whole region. A sympathetic reader would care because stateless policies prove too coarse for gigascale DRAM caches while full per-line tracking is bandwidth-prohibitive. The result is an 18% speedup on a 2GB DRAM cache using under 1KB of SRAM and 70% less state-update bandwidth than tracking every line.

Core claim

The central claim is that reuse state can be tracked efficiently enough for DRAM caches by sampling only one line per region and using its state to direct replacement and bypass decisions for every line in that region. This enables the RRIP-AOB policy, which tracks high-reuse lines, protects them by bypassing others, and ages their state on bypass, to deliver the hit-rate benefits of stateful policies while keeping bandwidth close to stateless ones.

What carries the argument

Efficient Tracking of Reuse (ETR), which monitors reuse state on one line per region to guide replacement decisions for the remaining lines in the region.

Load-bearing premise

That monitoring reuse state for only one line per region supplies sufficiently accurate guidance for replacement decisions across all lines in the region.

What would settle it

A workload in which lines inside the same region show sharply different reuse patterns, when run with ETR, produces hit rates no better than always-install or probabilistic bypass.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Stateful replacement policies become bandwidth-viable for DRAM caches instead of being limited to stateless schemes.
Common thrashing patterns in gigascale caches are mitigated, raising overall hit rates.
Performance improves by 18% on a 2GB DRAM cache while SRAM overhead stays below 1KB.
State-tracking bandwidth falls by 70% relative to per-line updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The region-sampling idea could apply to other bandwidth-limited structures such as last-level caches or memory controllers.
Workloads with low reuse homogeneity inside regions would likely see smaller gains, suggesting a possible need for adaptive region sizing.
Combining ETR with existing hybrid memory or tiered-cache designs could further reduce off-chip traffic in future systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

RRIP-AOB plus ETR brings stateful replacement to DRAM caches with low bandwidth cost, but the single-line-per-region approximation in ETR is the part that still needs evidence.

read the letter

The core contribution is adapting RRIP into RRIP-AOB so that high-reuse lines get protected via bypass, then using ETR to track reuse state for only one line per region and apply it to the rest. This directly targets the bandwidth problem that has kept DRAM caches on stateless policies. The paper shows why always-install or probabilistic bypass fall short at gigascale associativity and gives concrete overhead numbers: 70% less bandwidth, under 1KB SRAM, and an 18% speedup on a 2GB cache. Those targets are the right ones for the setting.

Referee Report

2 major / 1 minor

Summary. The paper proposes RRIP-AOB, a stateful replacement/bypass policy that tracks reuse state for high-reuse lines and ages state on bypass to mitigate thrashing in gigascale DRAM caches, combined with ETR, which approximates state tracking by monitoring reuse for only one line per region and applying it to guide decisions for the region. This is claimed to deliver the hit-rate benefits of stateful policies while reducing bandwidth by 70% and SRAM overhead to under 1KB. Evaluations on a 2GB DRAM cache report an 18% speedup over baselines.

Significance. If the results hold under rigorous validation, the work would be significant for DRAM cache design by demonstrating a practical way to deploy intelligent, reuse-aware policies at scale without prohibitive bandwidth or storage costs. The ETR approximation directly targets the core tension between statefulness and efficiency in large caches.

major comments (2)

[ETR description and evaluations] The ETR technique (described after RRIP-AOB): the central 18% speedup and 70% bandwidth claims rest on the assumption that reuse state from a single monitored line per region accurately guides replacement/bypass for all lines in that region. No ablation, error quantification, or sensitivity analysis versus full per-line tracking is supplied to bound the approximation error when intra-region reuse distances are heterogeneous, which is common at gigascale associativity and directly undermines the load-bearing claim that ETR preserves RRIP-AOB benefits.
[Abstract and evaluations] Abstract and evaluation sections: the reported 18% speedup and <1KB SRAM figures are presented without any description of the experimental setup, workload list, baseline policies, simulation parameters, or statistical error analysis, preventing verification that the gains are attributable to RRIP-AOB+ETR rather than workload selection or unstated defaults.

minor comments (1)

[Abstract] Abstract: inconsistent capitalization in 'Ages the state On cache Bypass' should be standardized for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [ETR description and evaluations] The ETR technique (described after RRIP-AOB): the central 18% speedup and 70% bandwidth claims rest on the assumption that reuse state from a single monitored line per region accurately guides replacement/bypass for all lines in that region. No ablation, error quantification, or sensitivity analysis versus full per-line tracking is supplied to bound the approximation error when intra-region reuse distances are heterogeneous, which is common at gigascale associativity and directly undermines the load-bearing claim that ETR preserves RRIP-AOB benefits.

Authors: We agree that the manuscript would be strengthened by including an ablation study, error quantification, and sensitivity analysis for ETR. The current version emphasizes overall benefits but does not explicitly bound approximation error under heterogeneous intra-region reuse. In revision, we will add these analyses comparing ETR to full per-line tracking to demonstrate that benefits are preserved. revision: yes
Referee: [Abstract and evaluations] Abstract and evaluation sections: the reported 18% speedup and <1KB SRAM figures are presented without any description of the experimental setup, workload list, baseline policies, simulation parameters, or statistical error analysis, preventing verification that the gains are attributable to RRIP-AOB+ETR rather than workload selection or unstated defaults.

Authors: We agree the abstract and evaluations lack sufficient methodological detail. We will revise to expand the abstract with key setup elements and add explicit descriptions of workloads, baselines, parameters, and statistical error analysis in the evaluations section to enable verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance claims are empirical simulation outcomes independent of policy definitions

full rationale

The paper defines RRIP-AOB and ETR as new replacement/bypass policies motivated by observed thrashing patterns in gigascale DRAM caches. ETR's core design (tracking reuse state for one line per region and applying it to the region) is presented as an engineering approximation to reduce bandwidth, not derived from equations or prior fitted values. The 18% speedup is reported solely from cycle-accurate simulations on a 2GB DRAM cache configuration; these results do not reduce to the policy definitions by construction, nor rely on self-citations for uniqueness theorems or ansatzes. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific fitted parameters or axioms; region granularity and reuse thresholds in ETR and RRIP-AOB function as implicit design choices whose values are not stated.

pith-pipeline@v0.9.0 · 5869 in / 1059 out tokens · 27121 ms · 2026-05-25T09:17:52.589486+00:00 · methodology

0 comments

read the original abstract

This paper investigates intelligent replacement policies for improving the hit-rate of gigascale DRAM caches. Cache replacement policies are commonly used to improve the hit-rate of on-chip caches. The most effective replacement policies often require the cache to track per-line reuse state to inform their decision. A fundamental challenge on DRAM caches, however, is that stateful policies would require significant bandwidth to maintain per-line DRAM cache state. As such, DRAM cache replacement policies have primarily been stateless policies, such as always-install or probabilistic bypass. Unfortunately, we find that stateless policies are often too coarse-grain and become ineffective at the size and associativity of DRAM caches. Ideally, we want a replacement policy that can obtain the hit-rate benefits of stateful replacement policies, but keep the bandwidth-efficiency of stateless policies. In our study, we find that tracking per-line reuse state can enable an effective replacement policy that can mitigate common thrashing patterns seen in gigascale caches. We propose a stateful replacement/bypass policy called RRIP Age-On-Bypass (RRIP-AOB), that tracks reuse state for high-reuse lines, protects such lines by bypassing other lines, and Ages the state On cache Bypass. Unfortunately, such a stateful technique requires significant bandwidth to update state. To this end, we propose Efficient Tracking of Reuse (ETR). ETR makes state tracking efficient by accurately tracking the state of only one line from a region, and using the state of that line to guide the replacement decisions for other lines in that region. ETR reduces the bandwidth for tracking replacement state by 70%, and makes stateful policies practical for DRAM caches. Our evaluations with a 2GB DRAM cache, show that our RRIP-AOB and ETR techniques provide 18% speedup while needing less than 1KB of SRAM.

Figures

Figures reproduced from arXiv: 1907.02167 by Moinuddin K. Qureshi, Vinson Young.

**Figure 2.** Figure 2: Organization of the DRAM cache used in KNL. DRAM cache is organized at a linesize of 64 bytes, is direct [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Re-Reference Interval Prediction (RRIP). [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of RRIP: Age-On-Bypass (RRIPAOB). The transition from one state to another is accomplished with replacement-state update operation. Such updates may consume significant bandwidth. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Speedup from different replacement policies over the baseline always-install direct-mapped DRAM cache. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: MPKI of baseline DRAM cache and RRIPAOB. RRIP-AOB reduces misses by 10%. 4.3 Benefits from Reuse-Based Replacement Intelligent replacement policies improve performance by reducing cache misses [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 10.** Figure 10: ETR’s representative-update and bypassdecision following enables similar RRIP-AOB install policy, at reduced update bandwidth (dashed box = benefit). To implement representative-update, we first need to pick a stable representative line. Prior work finds the first access to a region is relatively consistent [31]. If we maintain state for just the first conflicting set in a region, we can maintain good r… view at source ↗

**Figure 9.** Figure 9: Distribution of RRPV of coresident lines on [PITH_FULL_IMAGE:figures/full_fig_p006_9.png] view at source ↗

**Figure 11.** Figure 11: Performance of RRIP-AOB, ETR on RRIP-AOB, and an Ideal RRIP-AOB with no state update costs. Coor [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗

**Figure 13.** Figure 13: Replacement and Install bandwidth consump [PITH_FULL_IMAGE:figures/full_fig_p007_13.png] view at source ↗

**Figure 12.** Figure 12: Design of Recent-Bypass-Table to enforce [PITH_FULL_IMAGE:figures/full_fig_p007_12.png] view at source ↗

**Figure 14.** Figure 14: Performance of ETR on RRIP-AOB, ETR on SHiP-AOB, and Ideal SHiP-AOB with no state update costs. [PITH_FULL_IMAGE:figures/full_fig_p008_14.png] view at source ↗

**Figure 15.** Figure 15: Operation and Organization of SHiP. 6.2 Adapting SHiP to Direct-Mapped Cache Conventional SHiP design always installs the incoming line, either with High-Priority or Low-Priority. Unfortunately, with a direct-mapped cache, doing so will degenerate into the Always-Install policy (baseline). We extend SHiP in the context of direct-mapped caches using the option of bypassing with SHiP-AOB. If the resident l… view at source ↗

**Figure 16.** Figure 16: Bandwidth usage of ETR on RRIP-AOB [left] [PITH_FULL_IMAGE:figures/full_fig_p008_16.png] view at source ↗

**Figure 17.** Figure 17: Performance of set-associative ACCORD, ETR on RRIP-AOB, and ACCORD with ETR on RRIP-AOB. [PITH_FULL_IMAGE:figures/full_fig_p009_17.png] view at source ↗

**Figure 19.** Figure 19: We first use ACCORD to select which way to [PITH_FULL_IMAGE:figures/full_fig_p009_19.png] view at source ↗

**Figure 18.** Figure 18: ACCORD enables low-latency associativity [PITH_FULL_IMAGE:figures/full_fig_p009_18.png] view at source ↗

**Figure 20.** Figure 20: L4 Read-Miss-Per-Kilo-Instruction of Always-Install, ACCORD, RRIP-AOB, and ACCORD + RRIP-AOB. Combination enables 20% miss reduction [PITH_FULL_IMAGE:figures/full_fig_p009_20.png] view at source ↗

**Figure 21.** Figure 21: Speedup of ETR on RRIP-AOB and Ideal RRIP-AOB on multi-programmed workloads. 8.2 Impact on Energy and Power [PITH_FULL_IMAGE:figures/full_fig_p010_21.png] view at source ↗

**Figure 22.** Figure 22: shows DRAM cache + memory power, energy consumption, and energy-delay-product (EDP) of a system using ETR, normalized to baseline DRAM cache. We model power and energy for stacked DRAM with [38,39], and model power and energy for non-volatile memory with [27]. ETR reduces DRAM cache energy by reducing install and state update bandwidth, and provides lower main memory energy by improving DRAM cache hit-rat… view at source ↗

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 2 internal anchors

[1]

INTRODUCTION DRAM caches are important for enabling effective hetero- geneous memory systems that can transparently provide the bandwidth of high bandwidth memories [1], and the capacity of high capacity memories [2, 3]. Designs for DRAM cache organize the tag-store such that the tags can be kept in DRAM (to reduce storage overheads) and yet the tags can ...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[2]

BACKGROUND AND MOTIV ATION We present the organization of our DRAM cache and dis- cuss the storage and bandwidth constraints that make it chal- lenging to apply intelligent replacement policies. 2.1 Organization of a DRAM Cache (KNL) As the tag storage required for gigascale DRAM caches is large, DRAM cache designs often store tags in DRAM and intelligent...

work page
[3]

We extend USIMM to include a DRAM cache

METHODOLOGY 3.1 Framework and Conﬁguration We use USIMM [20], an x86 simulator with detailed mem- ory system model. We extend USIMM to include a DRAM cache. Table 1 shows the conﬁguration used in our study. We model a conﬁguration similar to a Intel Knights Landing (KNL) Sub-NUMA Cluster (one-eighth size). We assume a four-level cache hierarchy (L1, L2, L...

work page 2006
[4]

4.1 RRIP as a Bypassing Policy We design a version of RRIP for limited-associativity caches, called RRIP: Age-On-Bypass (RRIP-AOB)

RRIP: AGE-ON-BYPASS If we want to use RRIP on direct-mapped DRAM caches, we have to solve two issues: how do we formulate RRIP as a bypassing policy suitable for caches with limited associativity, and how can we mitigate the state update cost of maintaining per-line reuse state in DRAM. 4.1 RRIP as a Bypassing Policy We design a version of RRIP for limite...

work page
[5]

We can avoid state update costs if we have an effective way to infer an RRPV state

EFFICIENT TRACKING OF REUSE Demoting state on every cache bypass incurs signiﬁcant bandwidth overheads–even if we choose to bypass the line, we still have to spend bandwidth to demote the replacement- state. We can avoid state update costs if we have an effective way to infer an RRPV state. Our design reduces the band- width consumed in performing updates...

work page
[6]

Hit, follow decision Region ID

work page
[7]

Miss, make new decision

work page
[8]

De- motions only occur on ﬁrst miss to a region

Update RBTPage C0 A Page A1 C Page B0 Figure 12: Design of Recent-Bypass-Table to enforce coordinated-bypass and coordinated-state-update. De- motions only occur on ﬁrst miss to a region. Operation of ETR: On cache miss, we index into RBT with Region-ID. If there is an RBT miss, we are currently access- ing the representative ﬁrst-conﬂicting-set in a regi...

work page
[9]

SIGNATURE-BASED POLICIES Thus far, we have discussed AOB and ETR only in the context of RRIP. However, AOB and ETR are actually general techniques that enable formulating direct-mapped versions of replacement policies, as well as reducing the bandwidth needed to maintain replacement policy state. AOB and ETR can make even state-of-the-art signature-based ...

work page
[10]

A recent proposal ACCORD [34] tries to make DRAM caches set-associative, to improve hit rate albeit at an expense of bandwidth and latency [35,36,37]

TOW ARDS SET-ASSOCIATIVE DESIGNS We evaluate our solutions in the context of a direct-mapped cache, but our designs and insights can be made applicable to set-associative caches. A recent proposal ACCORD [34] tries to make DRAM caches set-associative, to improve hit rate albeit at an expense of bandwidth and latency [35,36,37]. We compare with the recentl...

work page
[11]

Due to space constraints, we limit these results to ETR implemented on RRIP-AOB

RESULTS AND DISCUSSION In this section we present sensitivity studies and storage analysis. Due to space constraints, we limit these results to ETR implemented on RRIP-AOB. 8.1 Multi-programmed Workloads To show robustness of our proposal to multi-programmed workloads, we evaluate over a larger set of 20 mix-application workloads. Figure 21 shows that ETR...

work page
[12]

Probabilistic replacement policies [17, 43], become probabilistic bypass [8] in Figure 5

RELATED WORK 9.1 Replacement / Bypassing policies Recency-based replacement policies [16, 41, 42] install in- coming lines at highest priority, which degenerate into always- install baseline. Probabilistic replacement policies [17, 43], become probabilistic bypass [8] in Figure 5. Frequency- based replacement [18, 19, 44, 45, 46] orReuse-based replace- me...

work page
[13]

We would like to use the most effective replacement policies to improve DRAM cache hit-rate

CONCLUSION This paper investigates improving hit-rate for direct-mapped DRAM caches by utilizing reuse-based replacement polices. We would like to use the most effective replacement policies to improve DRAM cache hit-rate. Unfortunately, state-of-the- art policies based on reuse are designed to compare multiple counter values within the set to decide a re...

work page
[14]

High bandwidth memory (hbm) dram,

J. Standard, “High bandwidth memory (hbm) dram,” JESD235, 2013

work page 2013
[15]

JEDEC, DDR4 SPEC (JESD79-4), 2013

work page 2013
[16]

A revolutionary breakthrough in memory technology,

Intel and Micron, “A revolutionary breakthrough in memory technology,” 2015

work page 2015
[17]

Knights landing: Second-generation intel xeon phi product,

A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y .-C. Liu, “Knights landing: Second-generation intel xeon phi product,” IEEE Micro, vol. 36, pp. 34–46, Mar 2016

work page 2016
[18]

Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design,

M. K. Qureshi and G. H. Loh, “Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design,” in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 235–246, Dec 2012

work page 2012
[19]

Bear: Techniques for mitigating bandwidth bloat in gigascale dram caches,

C. Chou, A. Jaleel, and M. K. Qureshi, “Bear: Techniques for mitigating bandwidth bloat in gigascale dram caches,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA ’15, (New York, NY , USA), pp. 198–210, ACM, 2015

work page 2015
[20]

Counter-based cache replacement and bypassing algorithms,

M. Kharbutli and Y . Solihin, “Counter-based cache replacement and bypassing algorithms,” IEEE Trans. Comput., vol. 57, pp. 433–447, Apr. 2008

work page 2008
[21]

A dueling segmented lru replacement algorithm with adaptive bypassing,

H. Gao and C. Wilkerson, “A dueling segmented lru replacement algorithm with adaptive bypassing,” in JWAC 2010-1st JILP Worshop on Computer Architecture Competitions: cache replacement Championship, 2010

work page 2010
[22]

High performance cache replacement using re-reference interval prediction (rrip),

A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, “High performance cache replacement using re-reference interval prediction (rrip),” in Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, (New York, NY , USA), pp. 60–71, ACM, 2010

work page 2010
[23]

Ship: Signature-based hit predictor for high performance caching,

C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr., and J. Emer, “Ship: Signature-based hit predictor for high performance caching,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, (New York, NY , USA), pp. 430–441, ACM, 2011

work page 2011
[24]

Ship++: Enhancing signature-based hit predictor for improved cache performance,

V . Young, C.-C. Chou, A. Jaleel, and M. Qureshi, “Ship++: Enhancing signature-based hit predictor for improved cache performance,” in The 2nd Cache Replacement Championship (CRC-2 Workshop in ISCA 2017), 2017

work page 2017
[25]

Back to the future: Leveraging belady’s algorithm for improved cache replacement,

A. Jain and C. Lin, “Back to the future: Leveraging belady’s algorithm for improved cache replacement,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 78–89, June 2016

work page 2016
[26]

Multiperspective reuse prediction,

D. A. Jiménez and E. Teran, “Multiperspective reuse prediction,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, (New York, NY , USA), pp. 436–448, ACM, 2017

work page 2017
[27]

Unison cache: A scalable and effective die-stacked dram cache,

D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsaﬁ, “Unison cache: A scalable and effective die-stacked dram cache,” in Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pp. 25–37, IEEE, 2014

work page 2014
[28]

Resilient die-stacked dram caches,

J. Sim, G. H. Loh, V . Sridharan, and M. O’Connor, “Resilient die-stacked dram caches,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, (New York, NY , USA), pp. 416–427, ACM, 2013

work page 2013
[29]

Modiﬁed lru policies for improving second-level cache behavior,

W. A. Wong and J.-L. Baer, “Modiﬁed lru policies for improving second-level cache behavior,” in High-Performance Computer Architecture, 2000. HPCA-6. Proceedings. Sixth International Symposium on, pp. 49–60, IEEE, 2000

work page 2000
[30]

Adaptive insertion policies for high performance caching,

M. K. Qureshi, A. Jaleel, Y . N. Patt, S. C. Steely, and J. Emer, “Adaptive insertion policies for high performance caching,” in Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA ’07, (New York, NY , USA), pp. 381–391, ACM, 2007

work page 2007
[31]

Data cache management using frequency-based replacement,

J. T. Robinson and M. V . Devarakonda, “Data cache management using frequency-based replacement,” in Proceedings of the 1990 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’90, (New York, NY , USA), pp. 134–142, ACM, 1990

work page 1990
[32]

The v-way cache: demand-based associativity via global replacement,

M. K. Qureshi, D. Thompson, and Y . N. Patt, “The v-way cache: demand-based associativity via global replacement,” in Computer Architecture, 2005. ISCA’05. Proceedings. 32nd International Symposium on, pp. 544–555, IEEE, 2005

work page 2005
[33]

Usimm: the utah simulated memory module,

N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi, A. Shaﬁee, K. Sudan, M. Awasthi, and Z. Chishti, “Usimm: the utah simulated memory module,” University of Utah, Tech. Rep, 2012

work page 2012
[34]

Knights landing: Second-generation intel xeon phi product,

A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y . C. Liu, “Knights landing: Second-generation intel xeon phi product,” IEEE Micro, vol. 36, pp. 34–46, Mar 2016

work page 2016
[35]

Dulloor, Jishen Zhao, and Steven Swanson

J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y . J. Soh, Z. Wang, Y . Xu, S. R. Dulloor, J. Zhao, and S. Swanson, “Basic performance measurements of the intel optane DC persistent memory module,” CoRR, vol. abs/1903.05714, 2019

work page arXiv 1903
[36]

Fact sheet: New intel architectures and technologies target expanded market opportunities,

Intel, “Fact sheet: New intel architectures and technologies target expanded market opportunities,” 2018. Accessed: 2019-03-20

work page 2018
[37]

Phase change memory: From devices to systems,

M. K. Qureshi, S. Gurumurthi, and B. Rajendran, “Phase change memory: From devices to systems,” Synthesis Lectures on Computer Architecture, vol. 6, no. 4, pp. 1–134, 2011

work page 2011
[38]

A 20nm 1.8v 8gb pram with 40mb/s program bandwidth,

Y . Choi, I. Song, M.-H. Park, H. Chung, S. Chang, B. Cho, J. Kim, Y . Oh, D. Kwon, J. Sunwoo, J. Shin, Y . Rho, C. Lee, M.-G. Kang, J. Lee, Y . Kwon, S. Kim, J. Kim, Y .-J. Lee, Q. Wang, S. Cha, S. Ahn, H. Horii, J. Lee, K. Kim, H. Joo, K. Lee, Y .-T. Lee, J. Yoo, and G. Jeong, “A 20nm 1.8v 8gb pram with 40mb/s program bandwidth,” in Solid-State Circuits...

work page 2012
[39]

Phase change memory,

H. S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Rajendran, M. Asheghi, and K. E. Goodson, “Phase change memory,” Proceedings of the IEEE, vol. 98, pp. 2201–2227, Dec 2010

work page 2010
[40]

Architecting phase change memory as a scalable dram alternative,

B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase change memory as a scalable dram alternative,” in Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, (New York, NY , USA), pp. 2–13, ACM, 2009

work page 2009
[41]

Pinpointing representative portions of large intel itanium programs with dynamic instrumentation,

H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi, “Pinpointing representative portions of large intel itanium programs with dynamic instrumentation,” in Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, pp. 81–92, Dec 2004

work page 2004
[42]

Spec cpu2006 benchmark descriptions,

J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCH Comput. Archit. News, vol. 34, pp. 1–17, Sept. 2006

work page 2006
[43]

The GAP Benchmark Suite

S. Beamer, K. Asanovic, and D. A. Patterson, “The GAP benchmark suite,” CoRR, vol. abs/1508.03619, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[44]

Spatial memory streaming,

S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsaﬁ, and A. Moshovos, “Spatial memory streaming,” in Proceedings of the 33rd Annual International Symposium on Computer Architecture, ISCA ’06, (Washington, DC, USA), pp. 252–263, IEEE Computer Society, 2006

work page 2006
[45]

Sampling dead block prediction for last-level caches,

S. M. Khan, Y . Tian, and D. A. Jimenez, “Sampling dead block prediction for last-level caches,” in Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’43, (Washington, DC, USA), pp. 175–186, IEEE Computer Society, 2010

work page 2010
[46]

Rethinking belady’s algorithm to accommodate prefetching,

A. Jain and C. Lin, “Rethinking belady’s algorithm to accommodate prefetching,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), June 2018

work page 2018
[47]

Accord: Enabling associativity for gigascale dram caches by coordinating way-install and way-prediction,

V . Young, C. Chou, A. Jaleel, and M. K. Qureshi, “Accord: Enabling associativity for gigascale dram caches by coordinating way-install and way-prediction,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 328–339, June 2018

work page 2018
[48]

Agarwal and S

A. Agarwal and S. D. Pudar, Column-associative caches: A technique for reducing the miss rate of direct-mapped caches, vol. 21. ACM, 1993

work page 1993
[49]

Predictive sequential associative cache,

B. Calder, D. Grunwald, and J. Emer, “Predictive sequential associative cache,” in Proceedings of the 2Nd IEEE Symposium on High-Performance Computer Architecture, HPCA ’96, (Washington, DC, USA), pp. 244–, IEEE Computer Society, 1996

work page 1996
[50]

Selective cache ways: On-demand cache resource allocation,

D. H. Albonesi, “Selective cache ways: On-demand cache resource allocation,” in Microarchitecture, 1999. MICRO-32. Proceedings. 32nd Annual International Symposium on, pp. 248–259, IEEE, 1999

work page 1999
[51]

System and circuit level power modeling of energy-efﬁcient 12 3d-stacked wide i/o drams,

K. Chandrasekar, C. Weis, B. Akesson, N. Wehn, and K. Goossens, “System and circuit level power modeling of energy-efﬁcient 12 3d-stacked wide i/o drams,” in Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’13, (San Jose, CA, USA), pp. 236–241, EDA Consortium, 2013

work page 2013
[52]

Rethinking dram power modes for energy proportionality,

K. T. Malladi, I. Shaeffer, L. Gopalakrishnan, D. Lo, B. C. Lee, and M. Horowitz, “Rethinking dram power modes for energy proportionality,” in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, (Washington, DC, USA), pp. 131–142, IEEE Computer Society, 2012

work page 2012
[53]

Enabling efﬁcient and scalable hybrid memories using ﬁne-granularity dram cache management,

J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling efﬁcient and scalable hybrid memories using ﬁne-granularity dram cache management,” IEEE Computer Architecture Letters, vol. 11, pp. 61–64, July 2012

work page 2012
[54]

Insertion and promotion for tree-based pseudolru last-level caches,

D. A. Jiménez, “Insertion and promotion for tree-based pseudolru last-level caches,” in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 284–296, ACM, 2013

work page 2013
[55]

Eelru: simple and effective adaptive page replacement,

Y . Smaragdakis, S. Kaplan, and P. Wilson, “Eelru: simple and effective adaptive page replacement,” in ACM SIGMETRICS Performance Evaluation Review, vol. 27, pp. 122–133, ACM, 1999

work page 1999
[56]

Adaptive insertion policies for managing shared caches,

A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, Jr., and J. Emer, “Adaptive insertion policies for managing shared caches,” in Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, (New York, NY , USA), pp. 208–219, ACM, 2008

work page 2008
[57]

A fully associative software-managed cache design,

E. G. Hallnor and S. K. Reinhardt, “A fully associative software-managed cache design,” in Proceedings of the 27th Annual International Symposium on Computer Architecture, ISCA ’00, (New York, NY , USA), pp. 107–116, ACM, 2000

work page 2000
[58]

The lru-k page replacement algorithm for database disk buffering,

E. J. O’Neil, P. E. O’Neil, and G. Weikum, “The lru-k page replacement algorithm for database disk buffering,” in Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, SIGMOD ’93, (New York, NY , USA), pp. 297–306, ACM, 1993

work page 1993
[59]

Lrfu: A spectrum of policies that subsumes the least recently used and least frequently used policies,

D. Lee, J. Choi, J. H. Kim, S. H. Noh, S. L. Min, Y . Cho, and C. S. Kim, “Lrfu: A spectrum of policies that subsumes the least recently used and least frequently used policies,” IEEE Trans. Comput., vol. 50, pp. 1352–1361, Dec. 2001

work page 2001
[60]

Improving cache management policies using dynamic reuse distances,

N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V . Veidenbaum, “Improving cache management policies using dynamic reuse distances,” in Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM International Symposium on, pp. 389–400, IEEE, 2012

work page 2012
[61]

Cache replacement based on reuse-distance prediction,

G. Keramidas, P. Petoumenos, and S. Kaxiras, “Cache replacement based on reuse-distance prediction,” in Computer Design, 2007. ICCD

work page 2007
[62]

245–250, IEEE, 2007

25th International Conference on, pp. 245–250, IEEE, 2007

work page 2007
[63]

Candy: Enabling coherent dram caches for multi-node systems,

C. Chou, A. Jaleel, and M. K. Qureshi, “Candy: Enabling coherent dram caches for multi-node systems,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13, Oct 2016

work page 2016
[64]

Efﬁciently enabling conventional block sizes for very large die-stacked dram caches,

G. H. Loh and M. D. Hill, “Efﬁciently enabling conventional block sizes for very large die-stacked dram caches,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, (New York, NY , USA), pp. 454–464, ACM, 2011

work page 2011
[65]

Atcache: reducing dram cache latency via a small sram tag cache,

C.-C. Huang and V . Nagarajan, “Atcache: reducing dram cache latency via a small sram tag cache,” in Proceedings of the 23rd international conference on Parallel architectures and compilation, pp. 51–60, ACM, 2014

work page 2014
[66]

Building a low latency, highly associative dram cache with the buffered way predictor,

Z. Wang, D. A. JimÃl’nez, T. Zhang, G. H. Loh, and Y . Xie, “Building a low latency, highly associative dram cache with the buffered way predictor,” in 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 109–117, Oct 2016

work page 2016
[67]

Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache,

D. Jevdjic, S. V olos, and B. Falsaﬁ, “Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, (New York, NY , USA), pp. 404–415, ACM, 2013

work page 2013
[68]

A fully associative, tagless dram cache,

Y . Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. W. Lee, “A fully associative, tagless dram cache,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA ’15, (New York, NY , USA), pp. 211–222, ACM, 2015

work page 2015
[69]

Efﬁcient footprint caching for tagless dram caches,

H. Jang, Y . Lee, J. Kim, Y . Kim, J. Kim, J. Jeong, and J. W. Lee, “Efﬁcient footprint caching for tagless dram caches,” in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, pp. 237–248, IEEE, 2016

work page 2016
[70]

Challenges in heterogeneous die-stacked and off-chip memory systems,

G. H Loh, N. Jayasena, J. Chung, S. K Reinhardt, M. O’Connor, and K. McGrath, “Challenges in heterogeneous die-stacked and off-chip memory systems,” in 3rd Workshop on SoCs, Heterogeneous Architectures and Workloads (SHAW-3), 02 2012

work page 2012
[71]

Banshee: Bandwidth-efﬁcient dram caching via software/hardware cooperation,

X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee: Bandwidth-efﬁcient dram caching via software/hardware cooperation,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, (New York, NY , USA), pp. 1–14, ACM, 2017. 13

work page 2017

[1] [1]

INTRODUCTION DRAM caches are important for enabling effective hetero- geneous memory systems that can transparently provide the bandwidth of high bandwidth memories [1], and the capacity of high capacity memories [2, 3]. Designs for DRAM cache organize the tag-store such that the tags can be kept in DRAM (to reduce storage overheads) and yet the tags can ...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[2] [2]

BACKGROUND AND MOTIV ATION We present the organization of our DRAM cache and dis- cuss the storage and bandwidth constraints that make it chal- lenging to apply intelligent replacement policies. 2.1 Organization of a DRAM Cache (KNL) As the tag storage required for gigascale DRAM caches is large, DRAM cache designs often store tags in DRAM and intelligent...

work page

[3] [3]

We extend USIMM to include a DRAM cache

METHODOLOGY 3.1 Framework and Conﬁguration We use USIMM [20], an x86 simulator with detailed mem- ory system model. We extend USIMM to include a DRAM cache. Table 1 shows the conﬁguration used in our study. We model a conﬁguration similar to a Intel Knights Landing (KNL) Sub-NUMA Cluster (one-eighth size). We assume a four-level cache hierarchy (L1, L2, L...

work page 2006

[4] [4]

4.1 RRIP as a Bypassing Policy We design a version of RRIP for limited-associativity caches, called RRIP: Age-On-Bypass (RRIP-AOB)

RRIP: AGE-ON-BYPASS If we want to use RRIP on direct-mapped DRAM caches, we have to solve two issues: how do we formulate RRIP as a bypassing policy suitable for caches with limited associativity, and how can we mitigate the state update cost of maintaining per-line reuse state in DRAM. 4.1 RRIP as a Bypassing Policy We design a version of RRIP for limite...

work page

[5] [5]

We can avoid state update costs if we have an effective way to infer an RRPV state

EFFICIENT TRACKING OF REUSE Demoting state on every cache bypass incurs signiﬁcant bandwidth overheads–even if we choose to bypass the line, we still have to spend bandwidth to demote the replacement- state. We can avoid state update costs if we have an effective way to infer an RRPV state. Our design reduces the band- width consumed in performing updates...

work page

[6] [6]

Hit, follow decision Region ID

work page

[7] [7]

Miss, make new decision

work page

[8] [8]

De- motions only occur on ﬁrst miss to a region

Update RBTPage C0 A Page A1 C Page B0 Figure 12: Design of Recent-Bypass-Table to enforce coordinated-bypass and coordinated-state-update. De- motions only occur on ﬁrst miss to a region. Operation of ETR: On cache miss, we index into RBT with Region-ID. If there is an RBT miss, we are currently access- ing the representative ﬁrst-conﬂicting-set in a regi...

work page

[9] [9]

SIGNATURE-BASED POLICIES Thus far, we have discussed AOB and ETR only in the context of RRIP. However, AOB and ETR are actually general techniques that enable formulating direct-mapped versions of replacement policies, as well as reducing the bandwidth needed to maintain replacement policy state. AOB and ETR can make even state-of-the-art signature-based ...

work page

[10] [10]

A recent proposal ACCORD [34] tries to make DRAM caches set-associative, to improve hit rate albeit at an expense of bandwidth and latency [35,36,37]

TOW ARDS SET-ASSOCIATIVE DESIGNS We evaluate our solutions in the context of a direct-mapped cache, but our designs and insights can be made applicable to set-associative caches. A recent proposal ACCORD [34] tries to make DRAM caches set-associative, to improve hit rate albeit at an expense of bandwidth and latency [35,36,37]. We compare with the recentl...

work page

[11] [11]

Due to space constraints, we limit these results to ETR implemented on RRIP-AOB

RESULTS AND DISCUSSION In this section we present sensitivity studies and storage analysis. Due to space constraints, we limit these results to ETR implemented on RRIP-AOB. 8.1 Multi-programmed Workloads To show robustness of our proposal to multi-programmed workloads, we evaluate over a larger set of 20 mix-application workloads. Figure 21 shows that ETR...

work page

[12] [12]

Probabilistic replacement policies [17, 43], become probabilistic bypass [8] in Figure 5

RELATED WORK 9.1 Replacement / Bypassing policies Recency-based replacement policies [16, 41, 42] install in- coming lines at highest priority, which degenerate into always- install baseline. Probabilistic replacement policies [17, 43], become probabilistic bypass [8] in Figure 5. Frequency- based replacement [18, 19, 44, 45, 46] orReuse-based replace- me...

work page

[13] [13]

We would like to use the most effective replacement policies to improve DRAM cache hit-rate

CONCLUSION This paper investigates improving hit-rate for direct-mapped DRAM caches by utilizing reuse-based replacement polices. We would like to use the most effective replacement policies to improve DRAM cache hit-rate. Unfortunately, state-of-the- art policies based on reuse are designed to compare multiple counter values within the set to decide a re...

work page

[14] [14]

High bandwidth memory (hbm) dram,

J. Standard, “High bandwidth memory (hbm) dram,” JESD235, 2013

work page 2013

[15] [15]

JEDEC, DDR4 SPEC (JESD79-4), 2013

work page 2013

[16] [16]

A revolutionary breakthrough in memory technology,

Intel and Micron, “A revolutionary breakthrough in memory technology,” 2015

work page 2015

[17] [17]

Knights landing: Second-generation intel xeon phi product,

A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y .-C. Liu, “Knights landing: Second-generation intel xeon phi product,” IEEE Micro, vol. 36, pp. 34–46, Mar 2016

work page 2016

[18] [18]

Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design,

M. K. Qureshi and G. H. Loh, “Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design,” in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 235–246, Dec 2012

work page 2012

[19] [19]

Bear: Techniques for mitigating bandwidth bloat in gigascale dram caches,

C. Chou, A. Jaleel, and M. K. Qureshi, “Bear: Techniques for mitigating bandwidth bloat in gigascale dram caches,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA ’15, (New York, NY , USA), pp. 198–210, ACM, 2015

work page 2015

[20] [20]

Counter-based cache replacement and bypassing algorithms,

M. Kharbutli and Y . Solihin, “Counter-based cache replacement and bypassing algorithms,” IEEE Trans. Comput., vol. 57, pp. 433–447, Apr. 2008

work page 2008

[21] [21]

A dueling segmented lru replacement algorithm with adaptive bypassing,

H. Gao and C. Wilkerson, “A dueling segmented lru replacement algorithm with adaptive bypassing,” in JWAC 2010-1st JILP Worshop on Computer Architecture Competitions: cache replacement Championship, 2010

work page 2010

[22] [22]

High performance cache replacement using re-reference interval prediction (rrip),

A. Jaleel, K. B. Theobald, S. C. Steely, Jr., and J. Emer, “High performance cache replacement using re-reference interval prediction (rrip),” in Proceedings of the 37th Annual International Symposium on Computer Architecture, ISCA ’10, (New York, NY , USA), pp. 60–71, ACM, 2010

work page 2010

[23] [23]

Ship: Signature-based hit predictor for high performance caching,

C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr., and J. Emer, “Ship: Signature-based hit predictor for high performance caching,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, (New York, NY , USA), pp. 430–441, ACM, 2011

work page 2011

[24] [24]

Ship++: Enhancing signature-based hit predictor for improved cache performance,

V . Young, C.-C. Chou, A. Jaleel, and M. Qureshi, “Ship++: Enhancing signature-based hit predictor for improved cache performance,” in The 2nd Cache Replacement Championship (CRC-2 Workshop in ISCA 2017), 2017

work page 2017

[25] [25]

Back to the future: Leveraging belady’s algorithm for improved cache replacement,

A. Jain and C. Lin, “Back to the future: Leveraging belady’s algorithm for improved cache replacement,” in 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 78–89, June 2016

work page 2016

[26] [26]

Multiperspective reuse prediction,

D. A. Jiménez and E. Teran, “Multiperspective reuse prediction,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, (New York, NY , USA), pp. 436–448, ACM, 2017

work page 2017

[27] [27]

Unison cache: A scalable and effective die-stacked dram cache,

D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsaﬁ, “Unison cache: A scalable and effective die-stacked dram cache,” in Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on, pp. 25–37, IEEE, 2014

work page 2014

[28] [28]

Resilient die-stacked dram caches,

J. Sim, G. H. Loh, V . Sridharan, and M. O’Connor, “Resilient die-stacked dram caches,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, (New York, NY , USA), pp. 416–427, ACM, 2013

work page 2013

[29] [29]

Modiﬁed lru policies for improving second-level cache behavior,

W. A. Wong and J.-L. Baer, “Modiﬁed lru policies for improving second-level cache behavior,” in High-Performance Computer Architecture, 2000. HPCA-6. Proceedings. Sixth International Symposium on, pp. 49–60, IEEE, 2000

work page 2000

[30] [30]

Adaptive insertion policies for high performance caching,

M. K. Qureshi, A. Jaleel, Y . N. Patt, S. C. Steely, and J. Emer, “Adaptive insertion policies for high performance caching,” in Proceedings of the 34th Annual International Symposium on Computer Architecture, ISCA ’07, (New York, NY , USA), pp. 381–391, ACM, 2007

work page 2007

[31] [31]

Data cache management using frequency-based replacement,

J. T. Robinson and M. V . Devarakonda, “Data cache management using frequency-based replacement,” in Proceedings of the 1990 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’90, (New York, NY , USA), pp. 134–142, ACM, 1990

work page 1990

[32] [32]

The v-way cache: demand-based associativity via global replacement,

M. K. Qureshi, D. Thompson, and Y . N. Patt, “The v-way cache: demand-based associativity via global replacement,” in Computer Architecture, 2005. ISCA’05. Proceedings. 32nd International Symposium on, pp. 544–555, IEEE, 2005

work page 2005

[33] [33]

Usimm: the utah simulated memory module,

N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi, A. Shaﬁee, K. Sudan, M. Awasthi, and Z. Chishti, “Usimm: the utah simulated memory module,” University of Utah, Tech. Rep, 2012

work page 2012

[34] [34]

Knights landing: Second-generation intel xeon phi product,

A. Sodani, R. Gramunt, J. Corbal, H. S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y . C. Liu, “Knights landing: Second-generation intel xeon phi product,” IEEE Micro, vol. 36, pp. 34–46, Mar 2016

work page 2016

[35] [35]

Dulloor, Jishen Zhao, and Steven Swanson

J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y . J. Soh, Z. Wang, Y . Xu, S. R. Dulloor, J. Zhao, and S. Swanson, “Basic performance measurements of the intel optane DC persistent memory module,” CoRR, vol. abs/1903.05714, 2019

work page arXiv 1903

[36] [36]

Fact sheet: New intel architectures and technologies target expanded market opportunities,

Intel, “Fact sheet: New intel architectures and technologies target expanded market opportunities,” 2018. Accessed: 2019-03-20

work page 2018

[37] [37]

Phase change memory: From devices to systems,

M. K. Qureshi, S. Gurumurthi, and B. Rajendran, “Phase change memory: From devices to systems,” Synthesis Lectures on Computer Architecture, vol. 6, no. 4, pp. 1–134, 2011

work page 2011

[38] [38]

A 20nm 1.8v 8gb pram with 40mb/s program bandwidth,

Y . Choi, I. Song, M.-H. Park, H. Chung, S. Chang, B. Cho, J. Kim, Y . Oh, D. Kwon, J. Sunwoo, J. Shin, Y . Rho, C. Lee, M.-G. Kang, J. Lee, Y . Kwon, S. Kim, J. Kim, Y .-J. Lee, Q. Wang, S. Cha, S. Ahn, H. Horii, J. Lee, K. Kim, H. Joo, K. Lee, Y .-T. Lee, J. Yoo, and G. Jeong, “A 20nm 1.8v 8gb pram with 40mb/s program bandwidth,” in Solid-State Circuits...

work page 2012

[39] [39]

Phase change memory,

H. S. P. Wong, S. Raoux, S. Kim, J. Liang, J. P. Reifenberg, B. Rajendran, M. Asheghi, and K. E. Goodson, “Phase change memory,” Proceedings of the IEEE, vol. 98, pp. 2201–2227, Dec 2010

work page 2010

[40] [40]

Architecting phase change memory as a scalable dram alternative,

B. C. Lee, E. Ipek, O. Mutlu, and D. Burger, “Architecting phase change memory as a scalable dram alternative,” in Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, (New York, NY , USA), pp. 2–13, ACM, 2009

work page 2009

[41] [41]

Pinpointing representative portions of large intel itanium programs with dynamic instrumentation,

H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi, “Pinpointing representative portions of large intel itanium programs with dynamic instrumentation,” in Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, pp. 81–92, Dec 2004

work page 2004

[42] [42]

Spec cpu2006 benchmark descriptions,

J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCH Comput. Archit. News, vol. 34, pp. 1–17, Sept. 2006

work page 2006

[43] [43]

The GAP Benchmark Suite

S. Beamer, K. Asanovic, and D. A. Patterson, “The GAP benchmark suite,” CoRR, vol. abs/1508.03619, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[44] [44]

Spatial memory streaming,

S. Somogyi, T. F. Wenisch, A. Ailamaki, B. Falsaﬁ, and A. Moshovos, “Spatial memory streaming,” in Proceedings of the 33rd Annual International Symposium on Computer Architecture, ISCA ’06, (Washington, DC, USA), pp. 252–263, IEEE Computer Society, 2006

work page 2006

[45] [45]

Sampling dead block prediction for last-level caches,

S. M. Khan, Y . Tian, and D. A. Jimenez, “Sampling dead block prediction for last-level caches,” in Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO ’43, (Washington, DC, USA), pp. 175–186, IEEE Computer Society, 2010

work page 2010

[46] [46]

Rethinking belady’s algorithm to accommodate prefetching,

A. Jain and C. Lin, “Rethinking belady’s algorithm to accommodate prefetching,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), June 2018

work page 2018

[47] [47]

Accord: Enabling associativity for gigascale dram caches by coordinating way-install and way-prediction,

V . Young, C. Chou, A. Jaleel, and M. K. Qureshi, “Accord: Enabling associativity for gigascale dram caches by coordinating way-install and way-prediction,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 328–339, June 2018

work page 2018

[48] [48]

Agarwal and S

A. Agarwal and S. D. Pudar, Column-associative caches: A technique for reducing the miss rate of direct-mapped caches, vol. 21. ACM, 1993

work page 1993

[49] [49]

Predictive sequential associative cache,

B. Calder, D. Grunwald, and J. Emer, “Predictive sequential associative cache,” in Proceedings of the 2Nd IEEE Symposium on High-Performance Computer Architecture, HPCA ’96, (Washington, DC, USA), pp. 244–, IEEE Computer Society, 1996

work page 1996

[50] [50]

Selective cache ways: On-demand cache resource allocation,

D. H. Albonesi, “Selective cache ways: On-demand cache resource allocation,” in Microarchitecture, 1999. MICRO-32. Proceedings. 32nd Annual International Symposium on, pp. 248–259, IEEE, 1999

work page 1999

[51] [51]

System and circuit level power modeling of energy-efﬁcient 12 3d-stacked wide i/o drams,

K. Chandrasekar, C. Weis, B. Akesson, N. Wehn, and K. Goossens, “System and circuit level power modeling of energy-efﬁcient 12 3d-stacked wide i/o drams,” in Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’13, (San Jose, CA, USA), pp. 236–241, EDA Consortium, 2013

work page 2013

[52] [52]

Rethinking dram power modes for energy proportionality,

K. T. Malladi, I. Shaeffer, L. Gopalakrishnan, D. Lo, B. C. Lee, and M. Horowitz, “Rethinking dram power modes for energy proportionality,” in Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-45, (Washington, DC, USA), pp. 131–142, IEEE Computer Society, 2012

work page 2012

[53] [53]

Enabling efﬁcient and scalable hybrid memories using ﬁne-granularity dram cache management,

J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling efﬁcient and scalable hybrid memories using ﬁne-granularity dram cache management,” IEEE Computer Architecture Letters, vol. 11, pp. 61–64, July 2012

work page 2012

[54] [54]

Insertion and promotion for tree-based pseudolru last-level caches,

D. A. Jiménez, “Insertion and promotion for tree-based pseudolru last-level caches,” in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 284–296, ACM, 2013

work page 2013

[55] [55]

Eelru: simple and effective adaptive page replacement,

Y . Smaragdakis, S. Kaplan, and P. Wilson, “Eelru: simple and effective adaptive page replacement,” in ACM SIGMETRICS Performance Evaluation Review, vol. 27, pp. 122–133, ACM, 1999

work page 1999

[56] [56]

Adaptive insertion policies for managing shared caches,

A. Jaleel, W. Hasenplaugh, M. Qureshi, J. Sebot, S. Steely, Jr., and J. Emer, “Adaptive insertion policies for managing shared caches,” in Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, PACT ’08, (New York, NY , USA), pp. 208–219, ACM, 2008

work page 2008

[57] [57]

A fully associative software-managed cache design,

E. G. Hallnor and S. K. Reinhardt, “A fully associative software-managed cache design,” in Proceedings of the 27th Annual International Symposium on Computer Architecture, ISCA ’00, (New York, NY , USA), pp. 107–116, ACM, 2000

work page 2000

[58] [58]

The lru-k page replacement algorithm for database disk buffering,

E. J. O’Neil, P. E. O’Neil, and G. Weikum, “The lru-k page replacement algorithm for database disk buffering,” in Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, SIGMOD ’93, (New York, NY , USA), pp. 297–306, ACM, 1993

work page 1993

[59] [59]

Lrfu: A spectrum of policies that subsumes the least recently used and least frequently used policies,

D. Lee, J. Choi, J. H. Kim, S. H. Noh, S. L. Min, Y . Cho, and C. S. Kim, “Lrfu: A spectrum of policies that subsumes the least recently used and least frequently used policies,” IEEE Trans. Comput., vol. 50, pp. 1352–1361, Dec. 2001

work page 2001

[60] [60]

Improving cache management policies using dynamic reuse distances,

N. Duong, D. Zhao, T. Kim, R. Cammarota, M. Valero, and A. V . Veidenbaum, “Improving cache management policies using dynamic reuse distances,” in Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM International Symposium on, pp. 389–400, IEEE, 2012

work page 2012

[61] [61]

Cache replacement based on reuse-distance prediction,

G. Keramidas, P. Petoumenos, and S. Kaxiras, “Cache replacement based on reuse-distance prediction,” in Computer Design, 2007. ICCD

work page 2007

[62] [62]

245–250, IEEE, 2007

25th International Conference on, pp. 245–250, IEEE, 2007

work page 2007

[63] [63]

Candy: Enabling coherent dram caches for multi-node systems,

C. Chou, A. Jaleel, and M. K. Qureshi, “Candy: Enabling coherent dram caches for multi-node systems,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13, Oct 2016

work page 2016

[64] [64]

Efﬁciently enabling conventional block sizes for very large die-stacked dram caches,

G. H. Loh and M. D. Hill, “Efﬁciently enabling conventional block sizes for very large die-stacked dram caches,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, (New York, NY , USA), pp. 454–464, ACM, 2011

work page 2011

[65] [65]

Atcache: reducing dram cache latency via a small sram tag cache,

C.-C. Huang and V . Nagarajan, “Atcache: reducing dram cache latency via a small sram tag cache,” in Proceedings of the 23rd international conference on Parallel architectures and compilation, pp. 51–60, ACM, 2014

work page 2014

[66] [66]

Building a low latency, highly associative dram cache with the buffered way predictor,

Z. Wang, D. A. JimÃl’nez, T. Zhang, G. H. Loh, and Y . Xie, “Building a low latency, highly associative dram cache with the buffered way predictor,” in 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 109–117, Oct 2016

work page 2016

[67] [67]

Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache,

D. Jevdjic, S. V olos, and B. Falsaﬁ, “Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, (New York, NY , USA), pp. 404–415, ACM, 2013

work page 2013

[68] [68]

A fully associative, tagless dram cache,

Y . Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. W. Lee, “A fully associative, tagless dram cache,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA ’15, (New York, NY , USA), pp. 211–222, ACM, 2015

work page 2015

[69] [69]

Efﬁcient footprint caching for tagless dram caches,

H. Jang, Y . Lee, J. Kim, Y . Kim, J. Kim, J. Jeong, and J. W. Lee, “Efﬁcient footprint caching for tagless dram caches,” in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, pp. 237–248, IEEE, 2016

work page 2016

[70] [70]

Challenges in heterogeneous die-stacked and off-chip memory systems,

G. H Loh, N. Jayasena, J. Chung, S. K Reinhardt, M. O’Connor, and K. McGrath, “Challenges in heterogeneous die-stacked and off-chip memory systems,” in 3rd Workshop on SoCs, Heterogeneous Architectures and Workloads (SHAW-3), 02 2012

work page 2012

[71] [71]

Banshee: Bandwidth-efﬁcient dram caching via software/hardware cooperation,

X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee: Bandwidth-efﬁcient dram caching via software/hardware cooperation,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, (New York, NY , USA), pp. 1–14, ACM, 2017. 13

work page 2017