pith. sign in

arxiv: 1907.02184 · v1 · pith:QNEINOARnew · submitted 2019-07-04 · 💻 cs.AR

TicToc: Enabling Bandwidth-Efficient DRAM Caching for both Hits and Misses in Hybrid Memory Systems

Pith reviewed 2026-05-25 09:12 UTC · model grok-4.3

classification 💻 cs.AR
keywords DRAM cachehybrid memory3D-XPointtag organizationbandwidth reductiondirty bitcache metadata
0
0 comments X

The pith

TicToc combines tag-inside and tag-outside DRAM cache organizations to deliver low hit latency and low miss bandwidth with only 34KB SRAM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a DRAM cache placed in front of slower 3D-XPoint memory can be made to serve both hits and misses efficiently by keeping both per-line tags inside data blocks and grouped tags outside them. Naively merging the two organizations increases bandwidth traffic for metadata updates, so the authors introduce a dirtiness bit sent to the last-level cache and a prediction step that marks lines dirty when they are first installed. With these changes the design yields a 10 percent speedup over a hit-optimized baseline while approaching the 14 percent gain of an idealized cache that would need 64MB of SRAM tags. A reader would care because high-capacity non-volatile memory only becomes practical if its access penalties can be hidden by a small, fast DRAM layer without exhausting memory bandwidth.

Core claim

TicToc provisions both TIC and TOC metadata inside the same DRAM cache. The dominant bandwidth cost comes from repeated dirty-bit checks for the TOC structure; this cost is reduced by carrying a DRAM Cache Dirtiness Bit to the last-level cache so known-dirty lines skip further checks, and by Preemptive Dirty Marking that sets the bit at install time for lines predicted to be written soon. On a 4GB DRAM cache backed by 3D-XPoint these changes produce a 10 percent speedup over baseline TIC while using only 34KB of SRAM.

What carries the argument

DRAM Cache Dirtiness Bit propagated to the last-level cache together with Preemptive Dirty Marking at install time; these two mechanisms prune and amortize the dominant TOC dirty-bit traffic that otherwise negates the benefit of the combined organization.

If this is right

  • A 4GB DRAM cache can reach within four percentage points of the performance of an idealized cache that stores tags in 64MB of SRAM.
  • The entire metadata scheme fits in 34KB of SRAM.
  • Both read and write traffic to the backing 3D-XPoint memory decrease because fewer tag and dirty-bit accesses are required.
  • The same DRAM cache now optimizes for both hits and misses instead of trading one for the other.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dirtiness-bit propagation idea could be applied to other cache metadata that must stay consistent across hierarchy levels.
  • Prediction accuracy for preemptive marking may improve if it incorporates program-counter history rather than a simple heuristic.
  • The approach may extend to other non-volatile memories that also exhibit read/write asymmetry and high access latency.

Load-bearing premise

The bandwidth saved by avoiding repeated dirty-bit traffic and initial updates will exceed any new overhead introduced by the extra bit and the prediction logic.

What would settle it

Run the paper's workloads on a simulator with the dirtiness bit and preemptive marking disabled versus enabled and measure whether total DRAM cache bandwidth and overall speedup match the reported 10 percent gain.

Figures

Figures reproduced from arXiv: 1907.02184 by Moinuddin K. Qureshi, Vinson Young, Zeshan Chishti.

Figure 1
Figure 1. Figure 1: (a) Channel-Sharing Hybrid Memory, and (b) Performance of hit-optimized Tag-Inside-Cacheline (TIC) [7], [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: DRAM cache organization and flow for (a) idealized Tag-In-SRAM, (b) hit-latency-optimized Tag-Inside [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: TicToc Metadata Organization queries hit/miss predictor to use TIC metadata for hits and TOC metadata [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Breakdown of bus bandwidth consumption for [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Breakdown of bus bandwidth consumption for [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Bandwidth for a typical (a) write path and (b) miss+install path. TicToc+PDM adds “Predicted-Dirty” state, [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Speedup of TOC, proposed TicToc, TicToc with DRAM Cache Dirtiness bit, TicToc with Preemptive Dirty [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Signature-based Write Predictor learns which [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Breakdown of bus bandwidth for dirty-optimized TicToc. Dirty-bit updates are greatly reduced. [PITH_FULL_IMAGE:figures/full_fig_p008_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Breakdown of bus bandwidth for dirty-optimized TicToc w/ Write-Aware Bypassing. Installs are mitigated. [PITH_FULL_IMAGE:figures/full_fig_p008_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Write-Aware Bypass. Reduce install band [PITH_FULL_IMAGE:figures/full_fig_p008_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Speedup of a no-DRAM-cache configuration, proposed TicToc organization, adding 90%-bypass, adding [PITH_FULL_IMAGE:figures/full_fig_p009_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Speedup of Channel-Shared Hybrid Mem￾ory, over Dedicated-Channel Hybrid Memory. Channel￾sharing enables up to 40% speedup. 6.4 Multi-programmed Workloads To show robustness of our proposal to multi-programmed workloads, we conduct evaluations over a larger set of 17 mix-application workloads [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Speedup of TicToc with dirty-bit optimiza [PITH_FULL_IMAGE:figures/full_fig_p010_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Speedup of TicToc (dirty-opt, bypassing) and [PITH_FULL_IMAGE:figures/full_fig_p010_16.png] view at source ↗
read the original abstract

This paper investigates bandwidth-efficient DRAM caching for hybrid DRAM + 3D-XPoint memories. 3D-XPoint is becoming a viable alternative to DRAM as it enables high-capacity and non-volatile main memory systems; however, 3D-XPoint has 4-8x slower read, and worse writes. As such, effective DRAM caching in front of 3D-XPoint is important to enable a high-capacity, low-latency, and high-write-bandwidth memory. There are two major approaches for DRAM cache design: (1) a Tag-Inside-Cacheline (TIC) organization that optimizes for hits, by storing tag next to each line such that one access gets both tag and data, and (2) a Tag-Outside-Cacheline (TOC) organization that optimizes for misses, by storing tags from multiple data-lines together such that one tag-access gets info for several data-lines. Ideally, we desire the low hit-latency of TIC, and the low miss-bandwidth of TOC. To this end, we propose TicToc, an organization that provisions both TIC and TOC to get hit and miss benefits of both. However, we find that naively combining both actually performs worse than TIC, because one needs to pay bandwidth to maintain both metadata. The main contribution of this work is developing architectural techniques to reduce the bandwidth of maintaining both TIC and TOC metadata. We find the majority of the bandwidth cost is due to maintaining TOC dirty bits. We propose DRAM Cache Dirtiness Bit, which carries DRAM cache dirty info to last-level caches, to prune repeated dirty-bit checks for known dirty lines. We then propose Preemptive Dirty Marking, which predicts which lines will be written and proactively marks dirty bit at install time, to amortize the initial dirty-bit update. Our evaluations on a 4GB DRAM cache with 3D-XPoint memory show that TicToc enables 10% speedup over baseline TIC, nearing 14% speedup possible with an idealized DRAM cache w/ 64MB of SRAM tags, while needing only 34KB SRAM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes TicToc, a DRAM cache design for hybrid DRAM + 3D-XPoint systems that combines Tag-Inside-Cacheline (TIC) and Tag-Outside-Cacheline (TOC) organizations to achieve both low hit latency and low miss bandwidth. It shows that a naive combination performs worse than TIC alone due to extra metadata maintenance bandwidth, primarily from TOC dirty bits. The main contributions are two techniques—DRAM Cache Dirtiness Bit (which propagates dirty state to the LLC) and Preemptive Dirty Marking (a predictor that marks lines dirty at install time)—to reduce this cost. On a 4GB DRAM cache, TicToc delivers 10% speedup over baseline TIC (approaching the 14% of an idealized 64MB-SRAM-tag cache) while using only 34KB SRAM.

Significance. If the net bandwidth savings from the proposed techniques are confirmed to outweigh their added LLC and predictor overheads, the result would be a practical advance in hybrid memory caching: it closes most of the gap to an idealized tag store without requiring large on-chip SRAM. The work directly addresses a well-known TIC/TOC trade-off and supplies concrete, low-overhead mechanisms that could be adopted in future non-volatile memory controllers.

major comments (2)
  1. [evaluation / techniques section] §4 (or the evaluation section describing the two techniques): the central 10% speedup claim rests on the assertion that DRAM Cache Dirtiness Bit plus Preemptive Dirty Marking produce net bandwidth savings that exceed the extra TOC dirty-bit traffic plus any new LLC bandwidth or predictor-misprediction writes. No table or figure quantifies the incremental LLC-to-memory traffic or misprediction-induced writes against the reported savings; without this breakdown the headline delta cannot be verified.
  2. [results / idealized baseline] Table or figure reporting the 10% and 14% speedups: the comparison to the idealized DRAM cache assumes 64 MB SRAM tags, yet the paper does not state whether the idealized model also includes the same LLC and memory-controller constraints that TicToc must satisfy; this makes the proximity claim difficult to interpret.
minor comments (2)
  1. [abstract] Abstract states performance numbers but supplies no workload names, simulation parameters, or error bars; the full manuscript should make these explicit in the evaluation section.
  2. [techniques description] The description of Preemptive Dirty Marking does not specify the predictor structure or training method; a short paragraph or diagram would clarify the 34 KB SRAM budget allocation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve clarity and verifiability of the results.

read point-by-point responses
  1. Referee: [evaluation / techniques section] §4 (or the evaluation section describing the two techniques): the central 10% speedup claim rests on the assertion that DRAM Cache Dirtiness Bit plus Preemptive Dirty Marking produce net bandwidth savings that exceed the extra TOC dirty-bit traffic plus any new LLC bandwidth or predictor-misprediction writes. No table or figure quantifies the incremental LLC-to-memory traffic or misprediction-induced writes against the reported savings; without this breakdown the headline delta cannot be verified.

    Authors: We agree that an explicit breakdown would strengthen the paper. The current results demonstrate the net performance benefit through end-to-end simulation, but do not isolate the incremental LLC-to-DRAM and misprediction traffic components. In revision we will add a new figure (or table) in Section 4 that reports these incremental bandwidth costs for each technique relative to the baseline TIC organization, allowing direct verification that the savings exceed the added overheads. revision: yes

  2. Referee: [results / idealized baseline] Table or figure reporting the 10% and 14% speedups: the comparison to the idealized DRAM cache assumes 64 MB SRAM tags, yet the paper does not state whether the idealized model also includes the same LLC and memory-controller constraints that TicToc must satisfy; this makes the proximity claim difficult to interpret.

    Authors: The idealized DRAM cache (64 MB SRAM tags) is modeled under identical system constraints as TicToc, including the same LLC size, replacement policy, memory controller, and 3D-XPoint timing parameters; only the tag storage is made infinite. We will revise the text in the results section and caption to explicitly state these modeling assumptions so the 14 % figure is directly comparable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; simulation-based architectural proposal

full rationale

The paper proposes TicToc as an architectural organization for DRAM caches and evaluates performance via simulation on a 4GB DRAM cache setup. No equations, fitted parameters, or derivation chains exist that could reduce claims to inputs by construction. Central results (10% speedup) are reported from direct simulation comparisons to TIC baseline and idealized cases, not from self-definitional metadata or self-citation load-bearing steps. This matches the default expectation of no circularity for non-derivational work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or fitted constants appear in the abstract; the contribution is an engineering design rather than a derivation resting on free parameters or new axioms.

pith-pipeline@v0.9.0 · 5931 in / 1161 out tokens · 32898 ms · 2026-05-25T09:12:34.694095+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 2 internal anchors

  1. [1]

    TicToc: Enabling Bandwidth-Efficient DRAM Caching for both Hits and Misses in Hybrid Memory Systems

    INTRODUCTION As memory systems scale, non-volatile memories or NVMs (such as, 3D-XPoint [1]) are emerging as viable alternatives to DRAM. NVMs offer the advantages of higher bit density and the ability to retain data after power outages. However, NVMs also have significant limitations that prevent them from outright replacing DRAM in the memory hierarchy. ...

  2. [2]

    A DRAM cache design has to balance multiple goals

    BACKGROUND AND MOTIV ATION DRAM caches are important for enabling heterogeneous memory systems to have the effective latency and bandwidth of one memory technology, and the capacity of another; how- ever, there are several challenges in designing DRAM caches. A DRAM cache design has to balance multiple goals. First, it should minimize the SRAM storage nee...

  3. [3]

    We extend USIMM to include a DRAM cache

    METHODOLOGY 3.1 Framework and Configuration We use USIMM [20], an x86 simulator with detailed mem- ory system model. We extend USIMM to include a DRAM cache. Table 2 shows the configuration used in our study. We assume a four-level cache hierarchy (L1, L2, L3 being on- chip SRAM caches and L4 being off-chip DRAM cache). All caches use 64B line size. We mode...

  4. [4]

    Predicted-Dirty

    TICTOC DESIGN DRAM caches need metadata to confirm if a line is cache resident or not (tag bits), and if the resident line is the most up-to-date copy (dirty bit). Tag-Inside-Cacheline (TIC) or- ganizations are optimized for hits as one access gets both metadata and data, but can suffer for misses as misses still need to access DRAM for metadata. In contra...

  5. [5]

    no DRAM cache

    REDUCING INSTALL BANDWIDTH WITH WRITE-A W ARE BYPASS When data has poor reuse, installing lines and updating TOC metadata wastes bandwidth. In fact, in such cases, em- ploying a DRAM cache could actually hurt performance, as the line install and tag maintenance operations needlessly steal bus bandwidth from memory accesses. Figure 13 shows the performance...

  6. [6]

    Due to space constraints, we limit results to TicToc with dirty-bit optimizations

    RESULTS AND DISCUSSION In this section we present sensitivity studies and storage analysis. Due to space constraints, we limit results to TicToc with dirty-bit optimizations. 6.1 Storage Requirements We analyze the SRAM storage requirements of our TicToc organization. TicToc requires structures from its component TIC and TOC organizations. Inheriting from...

  7. [7]

    TIC designs [7, 11, 18, 30, 31, 32] organize their cache as direct-mapped and store tag inside the cacheline, such that one access can retrieve both tag and data

    RELATED WORK 7.1 Line-based DRAM Caches In our work, we utilize and combine the two major types of line-granularity DRAM cache designs: Tag-Inside-Cacheline (TIC) and Tag-Outside-Cacheline (TOC) approaches. TIC designs [7, 11, 18, 30, 31, 32] organize their cache as direct-mapped and store tag inside the cacheline, such that one access can retrieve both t...

  8. [8]

    CONCLUSION This paper investigates bandwidth-efficient DRAM caching for hybrid DRAM + 3D-XPoint memories. Effective DRAM caching in front of 3D-XPoint is critical to enabling a mem- ory system that has the apparent high-capacity of 3D-XPoint, and the low-latency and high-write-bandwidth of DRAM. There are two currently major approaches for DRAM cache desig...

  9. [9]

    A revolutionary breakthrough in memory technology,

    Intel and Micron, “A revolutionary breakthrough in memory technology,” 2015

  10. [10]

    Basic performance measurements of the intel optane DC persistent memory module,

    J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y . J. Soh, Z. Wang, Y . Xu, S. R. Dulloor, J. Zhao, and S. Swanson, “Basic performance measurements of the intel optane DC persistent memory module,” CoRR, vol. abs/1903.05714, 2019

  11. [11]

    Intel© optane™ dc persistent memory operating modes explained,

    A. Ilkbahar, “Intel© optane™ dc persistent memory operating modes explained,” 2018. Accessed: 2019-03-20

  12. [12]

    Scalable high performance main memory system using phase-change memory technology,

    M. K. Qureshi, V . Srinivasan, and J. A. Rivers, “Scalable high performance main memory system using phase-change memory technology,” in Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, (New York, NY , USA), pp. 24–33, ACM, 2009

  13. [13]

    Pdram: A hybrid pram and dram main memory system,

    G. Dhiman, R. Ayoub, and T. Rosing, “Pdram: A hybrid pram and dram main memory system,” in 2009 46th ACM/IEEE Design Automation Conference, pp. 664–669, July 2009

  14. [14]

    Architectural design for next generation heterogeneous memory systems,

    A. Bivens, P. Dube, M. Franceschini, J. Karidis, L. Lastras, and M. Tsao, “Architectural design for next generation heterogeneous memory systems,” in Memory Workshop (IMW), 2010 IEEE International, pp. 1–4, IEEE, 2010

  15. [15]

    Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design,

    M. K. Qureshi and G. H. Loh, “Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design,” in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 235–246, Dec 2012

  16. [16]

    Enabling efficient and scalable hybrid memories using fine-granularity dram cache management,

    J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling efficient and scalable hybrid memories using fine-granularity dram cache management,” IEEE Computer Architecture Letters, vol. 11, pp. 61–64, July 2012

  17. [17]

    Efficiently enabling conventional block sizes for very large die-stacked dram caches,

    G. H. Loh and M. D. Hill, “Efficiently enabling conventional block sizes for very large die-stacked dram caches,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, (New York, NY , USA), pp. 454–464, ACM, 2011

  18. [18]

    Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache,

    D. Jevdjic, S. V olos, and B. Falsafi, “Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, (New York, NY , USA), pp. 404–415, ACM, 2013

  19. [19]

    Knights landing: Second-generation intel xeon phi product,

    A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y .-C. Liu, “Knights landing: Second-generation intel xeon phi product,” IEEE Micro, vol. 36, pp. 34–46, Mar 2016

  20. [20]

    Resilient die-stacked dram caches,

    J. Sim, G. H. Loh, V . Sridharan, and M. O’Connor, “Resilient die-stacked dram caches,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, (New York, NY , USA), pp. 416–427, ACM, 2013

  21. [21]

    Unison cache: A scalable and effective die-stacked dram cache,

    D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi, “Unison cache: A scalable and effective die-stacked dram cache,” in Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on , pp. 25–37, IEEE, 2014

  22. [22]

    High bandwidth memory (hbm) dram,

    J. Standard, “High bandwidth memory (hbm) dram,” JESD235, 2013

  23. [23]

    JEDEC, DDR4 SPEC (JESD79-4), 2013

  24. [24]

    Intel’s crazy-fast 3d xpoint optane memory heads for ddr slots (but with a catch),

    ArsTechnica, “Intel’s crazy-fast 3d xpoint optane memory heads for ddr slots (but with a catch),” 2018. Accessed: 2019-01-23

  25. [25]

    Cascade lake: Next generation intel xeon scalable processor,

    M. Arafa, B. Fahim, S. Kottapalli, A. Kumar, L. P. Looi, S. Mandava, A. Rudoff, I. M. Steiner, B. Valentine, G. Vedaraman, and S. V ora, “Cascade lake: Next generation intel xeon scalable processor,” IEEE Micro, vol. 39, pp. 29–36, March 2019

  26. [26]

    Bear: Techniques for mitigating bandwidth bloat in gigascale dram caches,

    C. Chou, A. Jaleel, and M. K. Qureshi, “Bear: Techniques for mitigating bandwidth bloat in gigascale dram caches,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA ’15, (New York, NY , USA), pp. 198–210, ACM, 2015

  27. [27]

    Sector cache design and performance,

    J. B. Rothman and A. J. Smith, “Sector cache design and performance,” in Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728), pp. 124–133, Aug 2000

  28. [28]

    Usimm: the utah simulated memory module,

    N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi, A. Shafiee, K. Sudan, M. Awasthi, and Z. Chishti, “Usimm: the utah simulated memory module,” University of Utah, Tech. Rep, 2012

  29. [29]

    Fact sheet: New intel architectures and technologies target expanded market opportunities,

    Intel, “Fact sheet: New intel architectures and technologies target expanded market opportunities,” 2018. Accessed: 2019-03-20

  30. [30]

    Pinpointing representative portions of large intel itanium programs with dynamic instrumentation,

    H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi, “Pinpointing representative portions of large intel itanium programs with dynamic instrumentation,” in Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, pp. 81–92, Dec 2004

  31. [31]

    Spec cpu2006 benchmark descriptions,

    J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCH Comput. Archit. News, vol. 34, pp. 1–17, Sept. 2006

  32. [32]

    The GAP Benchmark Suite

    S. Beamer, K. Asanovic, and D. A. Patterson, “The GAP benchmark suite,” CoRR, vol. abs/1508.03619, 2015

  33. [33]

    A mostly-clean dram cache for effective hit speculation and self-balancing dispatch,

    J. Sim, G. H. Loh, H. Kim, M. O’Connor, and M. Thottethodi, “A mostly-clean dram cache for effective hit speculation and self-balancing dispatch,” in Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM International Symposium on , pp. 247–257, IEEE, 2012

  34. [34]

    Ship: Signature-based hit predictor for high performance caching,

    C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr., and J. Emer, “Ship: Signature-based hit predictor for high performance caching,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, (New York, NY , USA), pp. 430–441, ACM, 2011

  35. [35]

    Ship++: Enhancing signature-based hit predictor for improved cache performance,

    V . Young, C.-C. Chou, A. Jaleel, and M. Qureshi, “Ship++: Enhancing signature-based hit predictor for improved cache performance,” in The 2nd Cache Replacement Championship (CRC-2 Workshop in ISCA 2017), 2017

  36. [36]

    Counter-based cache replacement and bypassing algorithms,

    M. Kharbutli and Y . Solihin, “Counter-based cache replacement and bypassing algorithms,” IEEE Trans. Comput., vol. 57, pp. 433–447, Apr. 2008

  37. [37]

    A dueling segmented lru replacement algorithm with adaptive bypassing,

    H. Gao and C. Wilkerson, “A dueling segmented lru replacement algorithm with adaptive bypassing,” in JWAC 2010-1st JILP Worshop on Computer Architecture Competitions: cache replacement Championship, 2010

  38. [38]

    Accord: Enabling associativity for gigascale dram caches by coordinating way-install and way-prediction,

    V . Young, C. Chou, A. Jaleel, and M. K. Qureshi, “Accord: Enabling associativity for gigascale dram caches by coordinating way-install and way-prediction,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 328–339, June 2018

  39. [39]

    Candy: Enabling coherent dram caches for multi-node systems,

    C. Chou, A. Jaleel, and M. K. Qureshi, “Candy: Enabling coherent dram caches for multi-node systems,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13, Oct 2016

  40. [40]

    Dice: Compressing dram caches for bandwidth and capacity,

    V . Young, P. J. Nair, and M. K. Qureshi, “Dice: Compressing dram caches for bandwidth and capacity,” in ISCA ’17, (New York, NY , USA), pp. 627–638, ACM, 2017

  41. [41]

    Atcache: reducing dram cache latency via a small sram tag cache,

    C.-C. Huang and V . Nagarajan, “Atcache: reducing dram cache latency via a small sram tag cache,” in Proceedings of the 23rd international conference on Parallel architectures and compilation, pp. 51–60, ACM, 2014

  42. [42]

    Building a low latency, highly associative dram cache with the buffered way predictor,

    Z. Wang, D. A. JimÃl’nez, T. Zhang, G. H. Loh, and Y . Xie, “Building a low latency, highly associative dram cache with the buffered way predictor,” in 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 109–117, Oct 2016

  43. [43]

    A fully associative, tagless dram cache,

    Y . Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. W. Lee, “A fully associative, tagless dram cache,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture , ISCA ’15, (New York, NY , USA), pp. 211–222, ACM, 2015

  44. [44]

    Efficient footprint caching for tagless dram caches,

    H. Jang, Y . Lee, J. Kim, Y . Kim, J. Kim, J. Jeong, and J. W. Lee, “Efficient footprint caching for tagless dram caches,” in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, pp. 237–248, IEEE, 2016

  45. [45]

    Challenges in heterogeneous die-stacked and off-chip memory systems,

    G. H Loh, N. Jayasena, J. Chung, S. K Reinhardt, M. O’Connor, and K. McGrath, “Challenges in heterogeneous die-stacked and off-chip memory systems,” in 3rd Workshop on SoCs, Heterogeneous Architectures and Workloads (SHA W-3), 02 2012

  46. [46]

    Banshee: Bandwidth-efficient dram caching via software/hardware cooperation,

    X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee: Bandwidth-efficient dram caching via software/hardware cooperation,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, (New York, NY , 12 USA), pp. 1–14, ACM, 2017

  47. [47]

    Cameo: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache,

    C. Chou, A. Jaleel, and M. K. Qureshi, “Cameo: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, (Washington, DC, USA), pp. 1–12, IEEE Computer Society, 2014

  48. [48]

    Transparent hardware management of stacked dram as part of memory,

    J. Sim, A. R. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim, “Transparent hardware management of stacked dram as part of memory,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, (Washington, DC, USA), pp. 13–24, IEEE Computer Society, 2014

  49. [49]

    Silc-fm: Subblocked interleaved cache-like flat memory organization,

    J. H. Ryoo, M. R. Meswani, A. Prodromou, and L. K. John, “Silc-fm: Subblocked interleaved cache-like flat memory organization,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 349–360, Feb 2017

  50. [50]

    Mempod: A clustered architecture for efficient and scalable migration in flat address space multi-level memories,

    A. Prodromou, M. Meswani, N. Jayasena, G. Loh, and D. M. Tullsen, “Mempod: A clustered architecture for efficient and scalable migration in flat address space multi-level memories,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 433–444, Feb 2017

  51. [51]

    Pageseer: Using page walks to trigger page swaps in hybrid memory systems,

    A. Kokolis, “Pageseer: Using page walks to trigger page swaps in hybrid memory systems,” 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 596–608, 2019

  52. [52]

    C3d: Mitigating the numa bottleneck via coherent dram caches,

    C. Huang, R. Kumar, M. Elver, B. Grot, and V . Nagarajan, “C3d: Mitigating the numa bottleneck via coherent dram caches,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12, Oct 2016

  53. [53]

    Cache coherence for GPU architectures,

    I. Singh, A. Shriraman, W. W. L. Fung, M. O’Connor, and T. M. Aamodt, “Cache coherence for GPU architectures,” in 19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 23-27, 2013 , 2013

  54. [54]

    Combining hw/sw mechanisms to improve numa performance of multi-gpu systems,

    V . Young, A. Jaleel, E. Bolotin, E. Ebrahimi, D. Nellans, and O. Villa, “Combining hw/sw mechanisms to improve numa performance of multi-gpu systems,” in MICRO ’18, October 2018. 13