TicToc: Enabling Bandwidth-Efficient DRAM Caching for both Hits and Misses in Hybrid Memory Systems

Moinuddin K. Qureshi; Vinson Young; Zeshan Chishti

arxiv: 1907.02184 · v1 · pith:QNEINOARnew · submitted 2019-07-04 · 💻 cs.AR

TicToc: Enabling Bandwidth-Efficient DRAM Caching for both Hits and Misses in Hybrid Memory Systems

Vinson Young , Zeshan Chishti , Moinuddin K. Qureshi This is my paper

Pith reviewed 2026-05-25 09:12 UTC · model grok-4.3

classification 💻 cs.AR

keywords DRAM cachehybrid memory3D-XPointtag organizationbandwidth reductiondirty bitcache metadata

0 comments

The pith

TicToc combines tag-inside and tag-outside DRAM cache organizations to deliver low hit latency and low miss bandwidth with only 34KB SRAM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a DRAM cache placed in front of slower 3D-XPoint memory can be made to serve both hits and misses efficiently by keeping both per-line tags inside data blocks and grouped tags outside them. Naively merging the two organizations increases bandwidth traffic for metadata updates, so the authors introduce a dirtiness bit sent to the last-level cache and a prediction step that marks lines dirty when they are first installed. With these changes the design yields a 10 percent speedup over a hit-optimized baseline while approaching the 14 percent gain of an idealized cache that would need 64MB of SRAM tags. A reader would care because high-capacity non-volatile memory only becomes practical if its access penalties can be hidden by a small, fast DRAM layer without exhausting memory bandwidth.

Core claim

TicToc provisions both TIC and TOC metadata inside the same DRAM cache. The dominant bandwidth cost comes from repeated dirty-bit checks for the TOC structure; this cost is reduced by carrying a DRAM Cache Dirtiness Bit to the last-level cache so known-dirty lines skip further checks, and by Preemptive Dirty Marking that sets the bit at install time for lines predicted to be written soon. On a 4GB DRAM cache backed by 3D-XPoint these changes produce a 10 percent speedup over baseline TIC while using only 34KB of SRAM.

What carries the argument

DRAM Cache Dirtiness Bit propagated to the last-level cache together with Preemptive Dirty Marking at install time; these two mechanisms prune and amortize the dominant TOC dirty-bit traffic that otherwise negates the benefit of the combined organization.

If this is right

A 4GB DRAM cache can reach within four percentage points of the performance of an idealized cache that stores tags in 64MB of SRAM.
The entire metadata scheme fits in 34KB of SRAM.
Both read and write traffic to the backing 3D-XPoint memory decrease because fewer tag and dirty-bit accesses are required.
The same DRAM cache now optimizes for both hits and misses instead of trading one for the other.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dirtiness-bit propagation idea could be applied to other cache metadata that must stay consistent across hierarchy levels.
Prediction accuracy for preemptive marking may improve if it incorporates program-counter history rather than a simple heuristic.
The approach may extend to other non-volatile memories that also exhibit read/write asymmetry and high access latency.

Load-bearing premise

The bandwidth saved by avoiding repeated dirty-bit traffic and initial updates will exceed any new overhead introduced by the extra bit and the prediction logic.

What would settle it

Run the paper's workloads on a simulator with the dirtiness bit and preemptive marking disabled versus enabled and measure whether total DRAM cache bandwidth and overall speedup match the reported 10 percent gain.

Figures

Figures reproduced from arXiv: 1907.02184 by Moinuddin K. Qureshi, Vinson Young, Zeshan Chishti.

**Figure 2.** Figure 2: DRAM cache organization and flow for (a) idealized Tag-In-SRAM, (b) hit-latency-optimized Tag-Inside [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: TicToc Metadata Organization queries hit/miss predictor to use TIC metadata for hits and TOC metadata [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Breakdown of bus bandwidth consumption for [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Breakdown of bus bandwidth consumption for [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Bandwidth for a typical (a) write path and (b) miss+install path. TicToc+PDM adds “Predicted-Dirty” state, [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Speedup of TOC, proposed TicToc, TicToc with DRAM Cache Dirtiness bit, TicToc with Preemptive Dirty [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Signature-based Write Predictor learns which [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 10.** Figure 10: Breakdown of bus bandwidth for dirty-optimized TicToc. Dirty-bit updates are greatly reduced. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗

**Figure 11.** Figure 11: Breakdown of bus bandwidth for dirty-optimized TicToc w/ Write-Aware Bypassing. Installs are mitigated. [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗

**Figure 12.** Figure 12: Write-Aware Bypass. Reduce install band [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗

**Figure 13.** Figure 13: Speedup of a no-DRAM-cache configuration, proposed TicToc organization, adding 90%-bypass, adding [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗

**Figure 14.** Figure 14: Speedup of Channel-Shared Hybrid Memory, over Dedicated-Channel Hybrid Memory. Channelsharing enables up to 40% speedup. 6.4 Multi-programmed Workloads To show robustness of our proposal to multi-programmed workloads, we conduct evaluations over a larger set of 17 mix-application workloads [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗

**Figure 15.** Figure 15: Speedup of TicToc with dirty-bit optimiza [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗

**Figure 16.** Figure 16: Speedup of TicToc (dirty-opt, bypassing) and [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗

read the original abstract

This paper investigates bandwidth-efficient DRAM caching for hybrid DRAM + 3D-XPoint memories. 3D-XPoint is becoming a viable alternative to DRAM as it enables high-capacity and non-volatile main memory systems; however, 3D-XPoint has 4-8x slower read, and worse writes. As such, effective DRAM caching in front of 3D-XPoint is important to enable a high-capacity, low-latency, and high-write-bandwidth memory. There are two major approaches for DRAM cache design: (1) a Tag-Inside-Cacheline (TIC) organization that optimizes for hits, by storing tag next to each line such that one access gets both tag and data, and (2) a Tag-Outside-Cacheline (TOC) organization that optimizes for misses, by storing tags from multiple data-lines together such that one tag-access gets info for several data-lines. Ideally, we desire the low hit-latency of TIC, and the low miss-bandwidth of TOC. To this end, we propose TicToc, an organization that provisions both TIC and TOC to get hit and miss benefits of both. However, we find that naively combining both actually performs worse than TIC, because one needs to pay bandwidth to maintain both metadata. The main contribution of this work is developing architectural techniques to reduce the bandwidth of maintaining both TIC and TOC metadata. We find the majority of the bandwidth cost is due to maintaining TOC dirty bits. We propose DRAM Cache Dirtiness Bit, which carries DRAM cache dirty info to last-level caches, to prune repeated dirty-bit checks for known dirty lines. We then propose Preemptive Dirty Marking, which predicts which lines will be written and proactively marks dirty bit at install time, to amortize the initial dirty-bit update. Our evaluations on a 4GB DRAM cache with 3D-XPoint memory show that TicToc enables 10% speedup over baseline TIC, nearing 14% speedup possible with an idealized DRAM cache w/ 64MB of SRAM tags, while needing only 34KB SRAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TicToc pairs TIC and TOC DRAM cache organizations and adds two dirty-bit techniques to cut metadata bandwidth enough for a net 10% gain over plain TIC.

read the letter

TicToc shows a way to run both a hit-optimized and a miss-optimized DRAM cache organization at the same time while keeping the extra metadata bandwidth low enough to still win overall. The two new techniques for handling dirty bits are the key. The paper starts from the observation that TIC is good for hits but bad for misses, and TOC is the opposite. Combining them naively hurts because of the cost to keep both sets of tags and dirty bits up to date. They focus on the dirty bit traffic as the main problem and move some state into the LLC with the Dirtiness Bit, plus use prediction to mark lines dirty ahead of time. This gets them to 10% better than plain TIC and close to what you'd get with huge SRAM tags. What works here is the low hardware cost and the targeted fix for the bandwidth issue. 34KB SRAM is modest, and the idea of pruning repeated checks makes sense for workloads where lines stay dirty for a while. The soft spot is the lack of detail on how much bandwidth the new mechanisms add back in. The Dirtiness Bit requires changes to the LLC and extra bits on the bus, and the predictor can mispredict. The abstract treats the net as positive, but without numbers on the incremental costs versus the savings, it's hard to judge if the overheads are truly small. The simulation results are given, but details on workloads and error bars aren't in the abstract. This paper is for architects working on hybrid DRAM-NVM systems. A reader who cares about practical DRAM cache designs will find the concrete numbers and the two techniques useful to build on. The thinking is clear and it cites the relevant prior work on TIC and TOC. It deserves peer review because the problem is real and the solution is implementable with low cost.

Referee Report

2 major / 2 minor

Summary. The paper proposes TicToc, a DRAM cache design for hybrid DRAM + 3D-XPoint systems that combines Tag-Inside-Cacheline (TIC) and Tag-Outside-Cacheline (TOC) organizations to achieve both low hit latency and low miss bandwidth. It shows that a naive combination performs worse than TIC alone due to extra metadata maintenance bandwidth, primarily from TOC dirty bits. The main contributions are two techniques—DRAM Cache Dirtiness Bit (which propagates dirty state to the LLC) and Preemptive Dirty Marking (a predictor that marks lines dirty at install time)—to reduce this cost. On a 4GB DRAM cache, TicToc delivers 10% speedup over baseline TIC (approaching the 14% of an idealized 64MB-SRAM-tag cache) while using only 34KB SRAM.

Significance. If the net bandwidth savings from the proposed techniques are confirmed to outweigh their added LLC and predictor overheads, the result would be a practical advance in hybrid memory caching: it closes most of the gap to an idealized tag store without requiring large on-chip SRAM. The work directly addresses a well-known TIC/TOC trade-off and supplies concrete, low-overhead mechanisms that could be adopted in future non-volatile memory controllers.

major comments (2)

[evaluation / techniques section] §4 (or the evaluation section describing the two techniques): the central 10% speedup claim rests on the assertion that DRAM Cache Dirtiness Bit plus Preemptive Dirty Marking produce net bandwidth savings that exceed the extra TOC dirty-bit traffic plus any new LLC bandwidth or predictor-misprediction writes. No table or figure quantifies the incremental LLC-to-memory traffic or misprediction-induced writes against the reported savings; without this breakdown the headline delta cannot be verified.
[results / idealized baseline] Table or figure reporting the 10% and 14% speedups: the comparison to the idealized DRAM cache assumes 64 MB SRAM tags, yet the paper does not state whether the idealized model also includes the same LLC and memory-controller constraints that TicToc must satisfy; this makes the proximity claim difficult to interpret.

minor comments (2)

[abstract] Abstract states performance numbers but supplies no workload names, simulation parameters, or error bars; the full manuscript should make these explicit in the evaluation section.
[techniques description] The description of Preemptive Dirty Marking does not specify the predictor structure or training method; a short paragraph or diagram would clarify the 34 KB SRAM budget allocation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve clarity and verifiability of the results.

read point-by-point responses

Referee: [evaluation / techniques section] §4 (or the evaluation section describing the two techniques): the central 10% speedup claim rests on the assertion that DRAM Cache Dirtiness Bit plus Preemptive Dirty Marking produce net bandwidth savings that exceed the extra TOC dirty-bit traffic plus any new LLC bandwidth or predictor-misprediction writes. No table or figure quantifies the incremental LLC-to-memory traffic or misprediction-induced writes against the reported savings; without this breakdown the headline delta cannot be verified.

Authors: We agree that an explicit breakdown would strengthen the paper. The current results demonstrate the net performance benefit through end-to-end simulation, but do not isolate the incremental LLC-to-DRAM and misprediction traffic components. In revision we will add a new figure (or table) in Section 4 that reports these incremental bandwidth costs for each technique relative to the baseline TIC organization, allowing direct verification that the savings exceed the added overheads. revision: yes
Referee: [results / idealized baseline] Table or figure reporting the 10% and 14% speedups: the comparison to the idealized DRAM cache assumes 64 MB SRAM tags, yet the paper does not state whether the idealized model also includes the same LLC and memory-controller constraints that TicToc must satisfy; this makes the proximity claim difficult to interpret.

Authors: The idealized DRAM cache (64 MB SRAM tags) is modeled under identical system constraints as TicToc, including the same LLC size, replacement policy, memory controller, and 3D-XPoint timing parameters; only the tag storage is made infinite. We will revise the text in the results section and caption to explicitly state these modeling assumptions so the 14 % figure is directly comparable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; simulation-based architectural proposal

full rationale

The paper proposes TicToc as an architectural organization for DRAM caches and evaluates performance via simulation on a 4GB DRAM cache setup. No equations, fitted parameters, or derivation chains exist that could reduce claims to inputs by construction. Central results (10% speedup) are reported from direct simulation comparisons to TIC baseline and idealized cases, not from self-definitional metadata or self-citation load-bearing steps. This matches the default expectation of no circularity for non-derivational work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or fitted constants appear in the abstract; the contribution is an engineering design rather than a derivation resting on free parameters or new axioms.

pith-pipeline@v0.9.0 · 5931 in / 1161 out tokens · 32898 ms · 2026-05-25T09:12:34.694095+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 2 internal anchors

[1]

TicToc: Enabling Bandwidth-Efficient DRAM Caching for both Hits and Misses in Hybrid Memory Systems

INTRODUCTION As memory systems scale, non-volatile memories or NVMs (such as, 3D-XPoint [1]) are emerging as viable alternatives to DRAM. NVMs offer the advantages of higher bit density and the ability to retain data after power outages. However, NVMs also have signiﬁcant limitations that prevent them from outright replacing DRAM in the memory hierarchy. ...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[2]

A DRAM cache design has to balance multiple goals

BACKGROUND AND MOTIV ATION DRAM caches are important for enabling heterogeneous memory systems to have the effective latency and bandwidth of one memory technology, and the capacity of another; how- ever, there are several challenges in designing DRAM caches. A DRAM cache design has to balance multiple goals. First, it should minimize the SRAM storage nee...

work page
[3]

We extend USIMM to include a DRAM cache

METHODOLOGY 3.1 Framework and Conﬁguration We use USIMM [20], an x86 simulator with detailed mem- ory system model. We extend USIMM to include a DRAM cache. Table 2 shows the conﬁguration used in our study. We assume a four-level cache hierarchy (L1, L2, L3 being on- chip SRAM caches and L4 being off-chip DRAM cache). All caches use 64B line size. We mode...

work page 2006
[4]

Predicted-Dirty

TICTOC DESIGN DRAM caches need metadata to conﬁrm if a line is cache resident or not (tag bits), and if the resident line is the most up-to-date copy (dirty bit). Tag-Inside-Cacheline (TIC) or- ganizations are optimized for hits as one access gets both metadata and data, but can suffer for misses as misses still need to access DRAM for metadata. In contra...

work page
[5]

no DRAM cache

REDUCING INSTALL BANDWIDTH WITH WRITE-A W ARE BYPASS When data has poor reuse, installing lines and updating TOC metadata wastes bandwidth. In fact, in such cases, em- ploying a DRAM cache could actually hurt performance, as the line install and tag maintenance operations needlessly steal bus bandwidth from memory accesses. Figure 13 shows the performance...

work page
[6]

Due to space constraints, we limit results to TicToc with dirty-bit optimizations

RESULTS AND DISCUSSION In this section we present sensitivity studies and storage analysis. Due to space constraints, we limit results to TicToc with dirty-bit optimizations. 6.1 Storage Requirements We analyze the SRAM storage requirements of our TicToc organization. TicToc requires structures from its component TIC and TOC organizations. Inheriting from...

work page
[7]

TIC designs [7, 11, 18, 30, 31, 32] organize their cache as direct-mapped and store tag inside the cacheline, such that one access can retrieve both tag and data

RELATED WORK 7.1 Line-based DRAM Caches In our work, we utilize and combine the two major types of line-granularity DRAM cache designs: Tag-Inside-Cacheline (TIC) and Tag-Outside-Cacheline (TOC) approaches. TIC designs [7, 11, 18, 30, 31, 32] organize their cache as direct-mapped and store tag inside the cacheline, such that one access can retrieve both t...

work page
[8]

CONCLUSION This paper investigates bandwidth-efﬁcient DRAM caching for hybrid DRAM + 3D-XPoint memories. Effective DRAM caching in front of 3D-XPoint is critical to enabling a mem- ory system that has the apparent high-capacity of 3D-XPoint, and the low-latency and high-write-bandwidth of DRAM. There are two currently major approaches for DRAM cache desig...

work page
[9]

A revolutionary breakthrough in memory technology,

Intel and Micron, “A revolutionary breakthrough in memory technology,” 2015

work page 2015
[10]

Basic performance measurements of the intel optane DC persistent memory module,

J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y . J. Soh, Z. Wang, Y . Xu, S. R. Dulloor, J. Zhao, and S. Swanson, “Basic performance measurements of the intel optane DC persistent memory module,” CoRR, vol. abs/1903.05714, 2019

work page arXiv 1903
[11]

Intel© optane™ dc persistent memory operating modes explained,

A. Ilkbahar, “Intel© optane™ dc persistent memory operating modes explained,” 2018. Accessed: 2019-03-20

work page 2018
[12]

Scalable high performance main memory system using phase-change memory technology,

M. K. Qureshi, V . Srinivasan, and J. A. Rivers, “Scalable high performance main memory system using phase-change memory technology,” in Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, (New York, NY , USA), pp. 24–33, ACM, 2009

work page 2009
[13]

Pdram: A hybrid pram and dram main memory system,

G. Dhiman, R. Ayoub, and T. Rosing, “Pdram: A hybrid pram and dram main memory system,” in 2009 46th ACM/IEEE Design Automation Conference, pp. 664–669, July 2009

work page 2009
[14]

Architectural design for next generation heterogeneous memory systems,

A. Bivens, P. Dube, M. Franceschini, J. Karidis, L. Lastras, and M. Tsao, “Architectural design for next generation heterogeneous memory systems,” in Memory Workshop (IMW), 2010 IEEE International, pp. 1–4, IEEE, 2010

work page 2010
[15]

Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design,

M. K. Qureshi and G. H. Loh, “Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design,” in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 235–246, Dec 2012

work page 2012
[16]

Enabling efﬁcient and scalable hybrid memories using ﬁne-granularity dram cache management,

J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling efﬁcient and scalable hybrid memories using ﬁne-granularity dram cache management,” IEEE Computer Architecture Letters, vol. 11, pp. 61–64, July 2012

work page 2012
[17]

Efﬁciently enabling conventional block sizes for very large die-stacked dram caches,

G. H. Loh and M. D. Hill, “Efﬁciently enabling conventional block sizes for very large die-stacked dram caches,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, (New York, NY , USA), pp. 454–464, ACM, 2011

work page 2011
[18]

Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache,

D. Jevdjic, S. V olos, and B. Falsaﬁ, “Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, (New York, NY , USA), pp. 404–415, ACM, 2013

work page 2013
[19]

Knights landing: Second-generation intel xeon phi product,

A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y .-C. Liu, “Knights landing: Second-generation intel xeon phi product,” IEEE Micro, vol. 36, pp. 34–46, Mar 2016

work page 2016
[20]

Resilient die-stacked dram caches,

J. Sim, G. H. Loh, V . Sridharan, and M. O’Connor, “Resilient die-stacked dram caches,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, (New York, NY , USA), pp. 416–427, ACM, 2013

work page 2013
[21]

Unison cache: A scalable and effective die-stacked dram cache,

D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsaﬁ, “Unison cache: A scalable and effective die-stacked dram cache,” in Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on , pp. 25–37, IEEE, 2014

work page 2014
[22]

High bandwidth memory (hbm) dram,

J. Standard, “High bandwidth memory (hbm) dram,” JESD235, 2013

work page 2013
[23]

JEDEC, DDR4 SPEC (JESD79-4), 2013

work page 2013
[24]

Intel’s crazy-fast 3d xpoint optane memory heads for ddr slots (but with a catch),

ArsTechnica, “Intel’s crazy-fast 3d xpoint optane memory heads for ddr slots (but with a catch),” 2018. Accessed: 2019-01-23

work page 2018
[25]

Cascade lake: Next generation intel xeon scalable processor,

M. Arafa, B. Fahim, S. Kottapalli, A. Kumar, L. P. Looi, S. Mandava, A. Rudoff, I. M. Steiner, B. Valentine, G. Vedaraman, and S. V ora, “Cascade lake: Next generation intel xeon scalable processor,” IEEE Micro, vol. 39, pp. 29–36, March 2019

work page 2019
[26]

Bear: Techniques for mitigating bandwidth bloat in gigascale dram caches,

C. Chou, A. Jaleel, and M. K. Qureshi, “Bear: Techniques for mitigating bandwidth bloat in gigascale dram caches,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA ’15, (New York, NY , USA), pp. 198–210, ACM, 2015

work page 2015
[27]

Sector cache design and performance,

J. B. Rothman and A. J. Smith, “Sector cache design and performance,” in Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728), pp. 124–133, Aug 2000

work page 2000
[28]

Usimm: the utah simulated memory module,

N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi, A. Shaﬁee, K. Sudan, M. Awasthi, and Z. Chishti, “Usimm: the utah simulated memory module,” University of Utah, Tech. Rep, 2012

work page 2012
[29]

Fact sheet: New intel architectures and technologies target expanded market opportunities,

Intel, “Fact sheet: New intel architectures and technologies target expanded market opportunities,” 2018. Accessed: 2019-03-20

work page 2018
[30]

Pinpointing representative portions of large intel itanium programs with dynamic instrumentation,

H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi, “Pinpointing representative portions of large intel itanium programs with dynamic instrumentation,” in Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, pp. 81–92, Dec 2004

work page 2004
[31]

Spec cpu2006 benchmark descriptions,

J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCH Comput. Archit. News, vol. 34, pp. 1–17, Sept. 2006

work page 2006
[32]

The GAP Benchmark Suite

S. Beamer, K. Asanovic, and D. A. Patterson, “The GAP benchmark suite,” CoRR, vol. abs/1508.03619, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[33]

A mostly-clean dram cache for effective hit speculation and self-balancing dispatch,

J. Sim, G. H. Loh, H. Kim, M. O’Connor, and M. Thottethodi, “A mostly-clean dram cache for effective hit speculation and self-balancing dispatch,” in Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM International Symposium on , pp. 247–257, IEEE, 2012

work page 2012
[34]

Ship: Signature-based hit predictor for high performance caching,

C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr., and J. Emer, “Ship: Signature-based hit predictor for high performance caching,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, (New York, NY , USA), pp. 430–441, ACM, 2011

work page 2011
[35]

Ship++: Enhancing signature-based hit predictor for improved cache performance,

V . Young, C.-C. Chou, A. Jaleel, and M. Qureshi, “Ship++: Enhancing signature-based hit predictor for improved cache performance,” in The 2nd Cache Replacement Championship (CRC-2 Workshop in ISCA 2017), 2017

work page 2017
[36]

Counter-based cache replacement and bypassing algorithms,

M. Kharbutli and Y . Solihin, “Counter-based cache replacement and bypassing algorithms,” IEEE Trans. Comput., vol. 57, pp. 433–447, Apr. 2008

work page 2008
[37]

A dueling segmented lru replacement algorithm with adaptive bypassing,

H. Gao and C. Wilkerson, “A dueling segmented lru replacement algorithm with adaptive bypassing,” in JWAC 2010-1st JILP Worshop on Computer Architecture Competitions: cache replacement Championship, 2010

work page 2010
[38]

Accord: Enabling associativity for gigascale dram caches by coordinating way-install and way-prediction,

V . Young, C. Chou, A. Jaleel, and M. K. Qureshi, “Accord: Enabling associativity for gigascale dram caches by coordinating way-install and way-prediction,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 328–339, June 2018

work page 2018
[39]

Candy: Enabling coherent dram caches for multi-node systems,

C. Chou, A. Jaleel, and M. K. Qureshi, “Candy: Enabling coherent dram caches for multi-node systems,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13, Oct 2016

work page 2016
[40]

Dice: Compressing dram caches for bandwidth and capacity,

V . Young, P. J. Nair, and M. K. Qureshi, “Dice: Compressing dram caches for bandwidth and capacity,” in ISCA ’17, (New York, NY , USA), pp. 627–638, ACM, 2017

work page 2017
[41]

Atcache: reducing dram cache latency via a small sram tag cache,

C.-C. Huang and V . Nagarajan, “Atcache: reducing dram cache latency via a small sram tag cache,” in Proceedings of the 23rd international conference on Parallel architectures and compilation, pp. 51–60, ACM, 2014

work page 2014
[42]

Building a low latency, highly associative dram cache with the buffered way predictor,

Z. Wang, D. A. JimÃl’nez, T. Zhang, G. H. Loh, and Y . Xie, “Building a low latency, highly associative dram cache with the buffered way predictor,” in 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 109–117, Oct 2016

work page 2016
[43]

A fully associative, tagless dram cache,

Y . Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. W. Lee, “A fully associative, tagless dram cache,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture , ISCA ’15, (New York, NY , USA), pp. 211–222, ACM, 2015

work page 2015
[44]

Efﬁcient footprint caching for tagless dram caches,

H. Jang, Y . Lee, J. Kim, Y . Kim, J. Kim, J. Jeong, and J. W. Lee, “Efﬁcient footprint caching for tagless dram caches,” in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, pp. 237–248, IEEE, 2016

work page 2016
[45]

Challenges in heterogeneous die-stacked and off-chip memory systems,

G. H Loh, N. Jayasena, J. Chung, S. K Reinhardt, M. O’Connor, and K. McGrath, “Challenges in heterogeneous die-stacked and off-chip memory systems,” in 3rd Workshop on SoCs, Heterogeneous Architectures and Workloads (SHA W-3), 02 2012

work page 2012
[46]

Banshee: Bandwidth-efﬁcient dram caching via software/hardware cooperation,

X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee: Bandwidth-efﬁcient dram caching via software/hardware cooperation,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, (New York, NY , 12 USA), pp. 1–14, ACM, 2017

work page 2017
[47]

Cameo: A two-level memory organization with capacity of main memory and ﬂexibility of hardware-managed cache,

C. Chou, A. Jaleel, and M. K. Qureshi, “Cameo: A two-level memory organization with capacity of main memory and ﬂexibility of hardware-managed cache,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, (Washington, DC, USA), pp. 1–12, IEEE Computer Society, 2014

work page 2014
[48]

Transparent hardware management of stacked dram as part of memory,

J. Sim, A. R. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim, “Transparent hardware management of stacked dram as part of memory,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, (Washington, DC, USA), pp. 13–24, IEEE Computer Society, 2014

work page 2014
[49]

Silc-fm: Subblocked interleaved cache-like ﬂat memory organization,

J. H. Ryoo, M. R. Meswani, A. Prodromou, and L. K. John, “Silc-fm: Subblocked interleaved cache-like ﬂat memory organization,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 349–360, Feb 2017

work page 2017
[50]

Mempod: A clustered architecture for efﬁcient and scalable migration in ﬂat address space multi-level memories,

A. Prodromou, M. Meswani, N. Jayasena, G. Loh, and D. M. Tullsen, “Mempod: A clustered architecture for efﬁcient and scalable migration in ﬂat address space multi-level memories,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 433–444, Feb 2017

work page 2017
[51]

Pageseer: Using page walks to trigger page swaps in hybrid memory systems,

A. Kokolis, “Pageseer: Using page walks to trigger page swaps in hybrid memory systems,” 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 596–608, 2019

work page 2019
[52]

C3d: Mitigating the numa bottleneck via coherent dram caches,

C. Huang, R. Kumar, M. Elver, B. Grot, and V . Nagarajan, “C3d: Mitigating the numa bottleneck via coherent dram caches,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12, Oct 2016

work page 2016
[53]

Cache coherence for GPU architectures,

I. Singh, A. Shriraman, W. W. L. Fung, M. O’Connor, and T. M. Aamodt, “Cache coherence for GPU architectures,” in 19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 23-27, 2013 , 2013

work page 2013
[54]

Combining hw/sw mechanisms to improve numa performance of multi-gpu systems,

V . Young, A. Jaleel, E. Bolotin, E. Ebrahimi, D. Nellans, and O. Villa, “Combining hw/sw mechanisms to improve numa performance of multi-gpu systems,” in MICRO ’18, October 2018. 13

work page 2018

[1] [1]

TicToc: Enabling Bandwidth-Efficient DRAM Caching for both Hits and Misses in Hybrid Memory Systems

INTRODUCTION As memory systems scale, non-volatile memories or NVMs (such as, 3D-XPoint [1]) are emerging as viable alternatives to DRAM. NVMs offer the advantages of higher bit density and the ability to retain data after power outages. However, NVMs also have signiﬁcant limitations that prevent them from outright replacing DRAM in the memory hierarchy. ...

work page internal anchor Pith review Pith/arXiv arXiv 1907

[2] [2]

A DRAM cache design has to balance multiple goals

BACKGROUND AND MOTIV ATION DRAM caches are important for enabling heterogeneous memory systems to have the effective latency and bandwidth of one memory technology, and the capacity of another; how- ever, there are several challenges in designing DRAM caches. A DRAM cache design has to balance multiple goals. First, it should minimize the SRAM storage nee...

work page

[3] [3]

We extend USIMM to include a DRAM cache

METHODOLOGY 3.1 Framework and Conﬁguration We use USIMM [20], an x86 simulator with detailed mem- ory system model. We extend USIMM to include a DRAM cache. Table 2 shows the conﬁguration used in our study. We assume a four-level cache hierarchy (L1, L2, L3 being on- chip SRAM caches and L4 being off-chip DRAM cache). All caches use 64B line size. We mode...

work page 2006

[4] [4]

Predicted-Dirty

TICTOC DESIGN DRAM caches need metadata to conﬁrm if a line is cache resident or not (tag bits), and if the resident line is the most up-to-date copy (dirty bit). Tag-Inside-Cacheline (TIC) or- ganizations are optimized for hits as one access gets both metadata and data, but can suffer for misses as misses still need to access DRAM for metadata. In contra...

work page

[5] [5]

no DRAM cache

REDUCING INSTALL BANDWIDTH WITH WRITE-A W ARE BYPASS When data has poor reuse, installing lines and updating TOC metadata wastes bandwidth. In fact, in such cases, em- ploying a DRAM cache could actually hurt performance, as the line install and tag maintenance operations needlessly steal bus bandwidth from memory accesses. Figure 13 shows the performance...

work page

[6] [6]

Due to space constraints, we limit results to TicToc with dirty-bit optimizations

RESULTS AND DISCUSSION In this section we present sensitivity studies and storage analysis. Due to space constraints, we limit results to TicToc with dirty-bit optimizations. 6.1 Storage Requirements We analyze the SRAM storage requirements of our TicToc organization. TicToc requires structures from its component TIC and TOC organizations. Inheriting from...

work page

[7] [7]

TIC designs [7, 11, 18, 30, 31, 32] organize their cache as direct-mapped and store tag inside the cacheline, such that one access can retrieve both tag and data

RELATED WORK 7.1 Line-based DRAM Caches In our work, we utilize and combine the two major types of line-granularity DRAM cache designs: Tag-Inside-Cacheline (TIC) and Tag-Outside-Cacheline (TOC) approaches. TIC designs [7, 11, 18, 30, 31, 32] organize their cache as direct-mapped and store tag inside the cacheline, such that one access can retrieve both t...

work page

[8] [8]

CONCLUSION This paper investigates bandwidth-efﬁcient DRAM caching for hybrid DRAM + 3D-XPoint memories. Effective DRAM caching in front of 3D-XPoint is critical to enabling a mem- ory system that has the apparent high-capacity of 3D-XPoint, and the low-latency and high-write-bandwidth of DRAM. There are two currently major approaches for DRAM cache desig...

work page

[9] [9]

A revolutionary breakthrough in memory technology,

Intel and Micron, “A revolutionary breakthrough in memory technology,” 2015

work page 2015

[10] [10]

Basic performance measurements of the intel optane DC persistent memory module,

J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y . J. Soh, Z. Wang, Y . Xu, S. R. Dulloor, J. Zhao, and S. Swanson, “Basic performance measurements of the intel optane DC persistent memory module,” CoRR, vol. abs/1903.05714, 2019

work page arXiv 1903

[11] [11]

Intel© optane™ dc persistent memory operating modes explained,

A. Ilkbahar, “Intel© optane™ dc persistent memory operating modes explained,” 2018. Accessed: 2019-03-20

work page 2018

[12] [12]

Scalable high performance main memory system using phase-change memory technology,

M. K. Qureshi, V . Srinivasan, and J. A. Rivers, “Scalable high performance main memory system using phase-change memory technology,” in Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, (New York, NY , USA), pp. 24–33, ACM, 2009

work page 2009

[13] [13]

Pdram: A hybrid pram and dram main memory system,

G. Dhiman, R. Ayoub, and T. Rosing, “Pdram: A hybrid pram and dram main memory system,” in 2009 46th ACM/IEEE Design Automation Conference, pp. 664–669, July 2009

work page 2009

[14] [14]

Architectural design for next generation heterogeneous memory systems,

A. Bivens, P. Dube, M. Franceschini, J. Karidis, L. Lastras, and M. Tsao, “Architectural design for next generation heterogeneous memory systems,” in Memory Workshop (IMW), 2010 IEEE International, pp. 1–4, IEEE, 2010

work page 2010

[15] [15]

Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design,

M. K. Qureshi and G. H. Loh, “Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design,” in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 235–246, Dec 2012

work page 2012

[16] [16]

Enabling efﬁcient and scalable hybrid memories using ﬁne-granularity dram cache management,

J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling efﬁcient and scalable hybrid memories using ﬁne-granularity dram cache management,” IEEE Computer Architecture Letters, vol. 11, pp. 61–64, July 2012

work page 2012

[17] [17]

Efﬁciently enabling conventional block sizes for very large die-stacked dram caches,

G. H. Loh and M. D. Hill, “Efﬁciently enabling conventional block sizes for very large die-stacked dram caches,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, (New York, NY , USA), pp. 454–464, ACM, 2011

work page 2011

[18] [18]

Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache,

D. Jevdjic, S. V olos, and B. Falsaﬁ, “Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, (New York, NY , USA), pp. 404–415, ACM, 2013

work page 2013

[19] [19]

Knights landing: Second-generation intel xeon phi product,

A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y .-C. Liu, “Knights landing: Second-generation intel xeon phi product,” IEEE Micro, vol. 36, pp. 34–46, Mar 2016

work page 2016

[20] [20]

Resilient die-stacked dram caches,

J. Sim, G. H. Loh, V . Sridharan, and M. O’Connor, “Resilient die-stacked dram caches,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, (New York, NY , USA), pp. 416–427, ACM, 2013

work page 2013

[21] [21]

Unison cache: A scalable and effective die-stacked dram cache,

D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsaﬁ, “Unison cache: A scalable and effective die-stacked dram cache,” in Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on , pp. 25–37, IEEE, 2014

work page 2014

[22] [22]

High bandwidth memory (hbm) dram,

J. Standard, “High bandwidth memory (hbm) dram,” JESD235, 2013

work page 2013

[23] [23]

JEDEC, DDR4 SPEC (JESD79-4), 2013

work page 2013

[24] [24]

Intel’s crazy-fast 3d xpoint optane memory heads for ddr slots (but with a catch),

ArsTechnica, “Intel’s crazy-fast 3d xpoint optane memory heads for ddr slots (but with a catch),” 2018. Accessed: 2019-01-23

work page 2018

[25] [25]

Cascade lake: Next generation intel xeon scalable processor,

M. Arafa, B. Fahim, S. Kottapalli, A. Kumar, L. P. Looi, S. Mandava, A. Rudoff, I. M. Steiner, B. Valentine, G. Vedaraman, and S. V ora, “Cascade lake: Next generation intel xeon scalable processor,” IEEE Micro, vol. 39, pp. 29–36, March 2019

work page 2019

[26] [26]

Bear: Techniques for mitigating bandwidth bloat in gigascale dram caches,

C. Chou, A. Jaleel, and M. K. Qureshi, “Bear: Techniques for mitigating bandwidth bloat in gigascale dram caches,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA ’15, (New York, NY , USA), pp. 198–210, ACM, 2015

work page 2015

[27] [27]

Sector cache design and performance,

J. B. Rothman and A. J. Smith, “Sector cache design and performance,” in Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728), pp. 124–133, Aug 2000

work page 2000

[28] [28]

Usimm: the utah simulated memory module,

N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi, A. Shaﬁee, K. Sudan, M. Awasthi, and Z. Chishti, “Usimm: the utah simulated memory module,” University of Utah, Tech. Rep, 2012

work page 2012

[29] [29]

Fact sheet: New intel architectures and technologies target expanded market opportunities,

Intel, “Fact sheet: New intel architectures and technologies target expanded market opportunities,” 2018. Accessed: 2019-03-20

work page 2018

[30] [30]

Pinpointing representative portions of large intel itanium programs with dynamic instrumentation,

H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi, “Pinpointing representative portions of large intel itanium programs with dynamic instrumentation,” in Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, pp. 81–92, Dec 2004

work page 2004

[31] [31]

Spec cpu2006 benchmark descriptions,

J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCH Comput. Archit. News, vol. 34, pp. 1–17, Sept. 2006

work page 2006

[32] [32]

The GAP Benchmark Suite

S. Beamer, K. Asanovic, and D. A. Patterson, “The GAP benchmark suite,” CoRR, vol. abs/1508.03619, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[33] [33]

A mostly-clean dram cache for effective hit speculation and self-balancing dispatch,

J. Sim, G. H. Loh, H. Kim, M. O’Connor, and M. Thottethodi, “A mostly-clean dram cache for effective hit speculation and self-balancing dispatch,” in Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM International Symposium on , pp. 247–257, IEEE, 2012

work page 2012

[34] [34]

Ship: Signature-based hit predictor for high performance caching,

C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr., and J. Emer, “Ship: Signature-based hit predictor for high performance caching,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, (New York, NY , USA), pp. 430–441, ACM, 2011

work page 2011

[35] [35]

Ship++: Enhancing signature-based hit predictor for improved cache performance,

V . Young, C.-C. Chou, A. Jaleel, and M. Qureshi, “Ship++: Enhancing signature-based hit predictor for improved cache performance,” in The 2nd Cache Replacement Championship (CRC-2 Workshop in ISCA 2017), 2017

work page 2017

[36] [36]

Counter-based cache replacement and bypassing algorithms,

M. Kharbutli and Y . Solihin, “Counter-based cache replacement and bypassing algorithms,” IEEE Trans. Comput., vol. 57, pp. 433–447, Apr. 2008

work page 2008

[37] [37]

A dueling segmented lru replacement algorithm with adaptive bypassing,

H. Gao and C. Wilkerson, “A dueling segmented lru replacement algorithm with adaptive bypassing,” in JWAC 2010-1st JILP Worshop on Computer Architecture Competitions: cache replacement Championship, 2010

work page 2010

[38] [38]

Accord: Enabling associativity for gigascale dram caches by coordinating way-install and way-prediction,

V . Young, C. Chou, A. Jaleel, and M. K. Qureshi, “Accord: Enabling associativity for gigascale dram caches by coordinating way-install and way-prediction,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 328–339, June 2018

work page 2018

[39] [39]

Candy: Enabling coherent dram caches for multi-node systems,

C. Chou, A. Jaleel, and M. K. Qureshi, “Candy: Enabling coherent dram caches for multi-node systems,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13, Oct 2016

work page 2016

[40] [40]

Dice: Compressing dram caches for bandwidth and capacity,

V . Young, P. J. Nair, and M. K. Qureshi, “Dice: Compressing dram caches for bandwidth and capacity,” in ISCA ’17, (New York, NY , USA), pp. 627–638, ACM, 2017

work page 2017

[41] [41]

Atcache: reducing dram cache latency via a small sram tag cache,

C.-C. Huang and V . Nagarajan, “Atcache: reducing dram cache latency via a small sram tag cache,” in Proceedings of the 23rd international conference on Parallel architectures and compilation, pp. 51–60, ACM, 2014

work page 2014

[42] [42]

Building a low latency, highly associative dram cache with the buffered way predictor,

Z. Wang, D. A. JimÃl’nez, T. Zhang, G. H. Loh, and Y . Xie, “Building a low latency, highly associative dram cache with the buffered way predictor,” in 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 109–117, Oct 2016

work page 2016

[43] [43]

A fully associative, tagless dram cache,

Y . Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. W. Lee, “A fully associative, tagless dram cache,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture , ISCA ’15, (New York, NY , USA), pp. 211–222, ACM, 2015

work page 2015

[44] [44]

Efﬁcient footprint caching for tagless dram caches,

H. Jang, Y . Lee, J. Kim, Y . Kim, J. Kim, J. Jeong, and J. W. Lee, “Efﬁcient footprint caching for tagless dram caches,” in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, pp. 237–248, IEEE, 2016

work page 2016

[45] [45]

Challenges in heterogeneous die-stacked and off-chip memory systems,

G. H Loh, N. Jayasena, J. Chung, S. K Reinhardt, M. O’Connor, and K. McGrath, “Challenges in heterogeneous die-stacked and off-chip memory systems,” in 3rd Workshop on SoCs, Heterogeneous Architectures and Workloads (SHA W-3), 02 2012

work page 2012

[46] [46]

Banshee: Bandwidth-efﬁcient dram caching via software/hardware cooperation,

X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee: Bandwidth-efﬁcient dram caching via software/hardware cooperation,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, (New York, NY , 12 USA), pp. 1–14, ACM, 2017

work page 2017

[47] [47]

Cameo: A two-level memory organization with capacity of main memory and ﬂexibility of hardware-managed cache,

C. Chou, A. Jaleel, and M. K. Qureshi, “Cameo: A two-level memory organization with capacity of main memory and ﬂexibility of hardware-managed cache,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, (Washington, DC, USA), pp. 1–12, IEEE Computer Society, 2014

work page 2014

[48] [48]

Transparent hardware management of stacked dram as part of memory,

J. Sim, A. R. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim, “Transparent hardware management of stacked dram as part of memory,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, (Washington, DC, USA), pp. 13–24, IEEE Computer Society, 2014

work page 2014

[49] [49]

Silc-fm: Subblocked interleaved cache-like ﬂat memory organization,

J. H. Ryoo, M. R. Meswani, A. Prodromou, and L. K. John, “Silc-fm: Subblocked interleaved cache-like ﬂat memory organization,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 349–360, Feb 2017

work page 2017

[50] [50]

Mempod: A clustered architecture for efﬁcient and scalable migration in ﬂat address space multi-level memories,

A. Prodromou, M. Meswani, N. Jayasena, G. Loh, and D. M. Tullsen, “Mempod: A clustered architecture for efﬁcient and scalable migration in ﬂat address space multi-level memories,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 433–444, Feb 2017

work page 2017

[51] [51]

Pageseer: Using page walks to trigger page swaps in hybrid memory systems,

A. Kokolis, “Pageseer: Using page walks to trigger page swaps in hybrid memory systems,” 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 596–608, 2019

work page 2019

[52] [52]

C3d: Mitigating the numa bottleneck via coherent dram caches,

C. Huang, R. Kumar, M. Elver, B. Grot, and V . Nagarajan, “C3d: Mitigating the numa bottleneck via coherent dram caches,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12, Oct 2016

work page 2016

[53] [53]

Cache coherence for GPU architectures,

I. Singh, A. Shriraman, W. W. L. Fung, M. O’Connor, and T. M. Aamodt, “Cache coherence for GPU architectures,” in 19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 23-27, 2013 , 2013

work page 2013

[54] [54]

Combining hw/sw mechanisms to improve numa performance of multi-gpu systems,

V . Young, A. Jaleel, E. Bolotin, E. Ebrahimi, D. Nellans, and O. Villa, “Combining hw/sw mechanisms to improve numa performance of multi-gpu systems,” in MICRO ’18, October 2018. 13

work page 2018