TicToc: Enabling Bandwidth-Efficient DRAM Caching for both Hits and Misses in Hybrid Memory Systems
Pith reviewed 2026-05-25 09:12 UTC · model grok-4.3
The pith
TicToc combines tag-inside and tag-outside DRAM cache organizations to deliver low hit latency and low miss bandwidth with only 34KB SRAM.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TicToc provisions both TIC and TOC metadata inside the same DRAM cache. The dominant bandwidth cost comes from repeated dirty-bit checks for the TOC structure; this cost is reduced by carrying a DRAM Cache Dirtiness Bit to the last-level cache so known-dirty lines skip further checks, and by Preemptive Dirty Marking that sets the bit at install time for lines predicted to be written soon. On a 4GB DRAM cache backed by 3D-XPoint these changes produce a 10 percent speedup over baseline TIC while using only 34KB of SRAM.
What carries the argument
DRAM Cache Dirtiness Bit propagated to the last-level cache together with Preemptive Dirty Marking at install time; these two mechanisms prune and amortize the dominant TOC dirty-bit traffic that otherwise negates the benefit of the combined organization.
If this is right
- A 4GB DRAM cache can reach within four percentage points of the performance of an idealized cache that stores tags in 64MB of SRAM.
- The entire metadata scheme fits in 34KB of SRAM.
- Both read and write traffic to the backing 3D-XPoint memory decrease because fewer tag and dirty-bit accesses are required.
- The same DRAM cache now optimizes for both hits and misses instead of trading one for the other.
Where Pith is reading between the lines
- The dirtiness-bit propagation idea could be applied to other cache metadata that must stay consistent across hierarchy levels.
- Prediction accuracy for preemptive marking may improve if it incorporates program-counter history rather than a simple heuristic.
- The approach may extend to other non-volatile memories that also exhibit read/write asymmetry and high access latency.
Load-bearing premise
The bandwidth saved by avoiding repeated dirty-bit traffic and initial updates will exceed any new overhead introduced by the extra bit and the prediction logic.
What would settle it
Run the paper's workloads on a simulator with the dirtiness bit and preemptive marking disabled versus enabled and measure whether total DRAM cache bandwidth and overall speedup match the reported 10 percent gain.
Figures
read the original abstract
This paper investigates bandwidth-efficient DRAM caching for hybrid DRAM + 3D-XPoint memories. 3D-XPoint is becoming a viable alternative to DRAM as it enables high-capacity and non-volatile main memory systems; however, 3D-XPoint has 4-8x slower read, and worse writes. As such, effective DRAM caching in front of 3D-XPoint is important to enable a high-capacity, low-latency, and high-write-bandwidth memory. There are two major approaches for DRAM cache design: (1) a Tag-Inside-Cacheline (TIC) organization that optimizes for hits, by storing tag next to each line such that one access gets both tag and data, and (2) a Tag-Outside-Cacheline (TOC) organization that optimizes for misses, by storing tags from multiple data-lines together such that one tag-access gets info for several data-lines. Ideally, we desire the low hit-latency of TIC, and the low miss-bandwidth of TOC. To this end, we propose TicToc, an organization that provisions both TIC and TOC to get hit and miss benefits of both. However, we find that naively combining both actually performs worse than TIC, because one needs to pay bandwidth to maintain both metadata. The main contribution of this work is developing architectural techniques to reduce the bandwidth of maintaining both TIC and TOC metadata. We find the majority of the bandwidth cost is due to maintaining TOC dirty bits. We propose DRAM Cache Dirtiness Bit, which carries DRAM cache dirty info to last-level caches, to prune repeated dirty-bit checks for known dirty lines. We then propose Preemptive Dirty Marking, which predicts which lines will be written and proactively marks dirty bit at install time, to amortize the initial dirty-bit update. Our evaluations on a 4GB DRAM cache with 3D-XPoint memory show that TicToc enables 10% speedup over baseline TIC, nearing 14% speedup possible with an idealized DRAM cache w/ 64MB of SRAM tags, while needing only 34KB SRAM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes TicToc, a DRAM cache design for hybrid DRAM + 3D-XPoint systems that combines Tag-Inside-Cacheline (TIC) and Tag-Outside-Cacheline (TOC) organizations to achieve both low hit latency and low miss bandwidth. It shows that a naive combination performs worse than TIC alone due to extra metadata maintenance bandwidth, primarily from TOC dirty bits. The main contributions are two techniques—DRAM Cache Dirtiness Bit (which propagates dirty state to the LLC) and Preemptive Dirty Marking (a predictor that marks lines dirty at install time)—to reduce this cost. On a 4GB DRAM cache, TicToc delivers 10% speedup over baseline TIC (approaching the 14% of an idealized 64MB-SRAM-tag cache) while using only 34KB SRAM.
Significance. If the net bandwidth savings from the proposed techniques are confirmed to outweigh their added LLC and predictor overheads, the result would be a practical advance in hybrid memory caching: it closes most of the gap to an idealized tag store without requiring large on-chip SRAM. The work directly addresses a well-known TIC/TOC trade-off and supplies concrete, low-overhead mechanisms that could be adopted in future non-volatile memory controllers.
major comments (2)
- [evaluation / techniques section] §4 (or the evaluation section describing the two techniques): the central 10% speedup claim rests on the assertion that DRAM Cache Dirtiness Bit plus Preemptive Dirty Marking produce net bandwidth savings that exceed the extra TOC dirty-bit traffic plus any new LLC bandwidth or predictor-misprediction writes. No table or figure quantifies the incremental LLC-to-memory traffic or misprediction-induced writes against the reported savings; without this breakdown the headline delta cannot be verified.
- [results / idealized baseline] Table or figure reporting the 10% and 14% speedups: the comparison to the idealized DRAM cache assumes 64 MB SRAM tags, yet the paper does not state whether the idealized model also includes the same LLC and memory-controller constraints that TicToc must satisfy; this makes the proximity claim difficult to interpret.
minor comments (2)
- [abstract] Abstract states performance numbers but supplies no workload names, simulation parameters, or error bars; the full manuscript should make these explicit in the evaluation section.
- [techniques description] The description of Preemptive Dirty Marking does not specify the predictor structure or training method; a short paragraph or diagram would clarify the 34 KB SRAM budget allocation.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve clarity and verifiability of the results.
read point-by-point responses
-
Referee: [evaluation / techniques section] §4 (or the evaluation section describing the two techniques): the central 10% speedup claim rests on the assertion that DRAM Cache Dirtiness Bit plus Preemptive Dirty Marking produce net bandwidth savings that exceed the extra TOC dirty-bit traffic plus any new LLC bandwidth or predictor-misprediction writes. No table or figure quantifies the incremental LLC-to-memory traffic or misprediction-induced writes against the reported savings; without this breakdown the headline delta cannot be verified.
Authors: We agree that an explicit breakdown would strengthen the paper. The current results demonstrate the net performance benefit through end-to-end simulation, but do not isolate the incremental LLC-to-DRAM and misprediction traffic components. In revision we will add a new figure (or table) in Section 4 that reports these incremental bandwidth costs for each technique relative to the baseline TIC organization, allowing direct verification that the savings exceed the added overheads. revision: yes
-
Referee: [results / idealized baseline] Table or figure reporting the 10% and 14% speedups: the comparison to the idealized DRAM cache assumes 64 MB SRAM tags, yet the paper does not state whether the idealized model also includes the same LLC and memory-controller constraints that TicToc must satisfy; this makes the proximity claim difficult to interpret.
Authors: The idealized DRAM cache (64 MB SRAM tags) is modeled under identical system constraints as TicToc, including the same LLC size, replacement policy, memory controller, and 3D-XPoint timing parameters; only the tag storage is made infinite. We will revise the text in the results section and caption to explicitly state these modeling assumptions so the 14 % figure is directly comparable. revision: yes
Circularity Check
No significant circularity; simulation-based architectural proposal
full rationale
The paper proposes TicToc as an architectural organization for DRAM caches and evaluates performance via simulation on a 4GB DRAM cache setup. No equations, fitted parameters, or derivation chains exist that could reduce claims to inputs by construction. Central results (10% speedup) are reported from direct simulation comparisons to TIC baseline and idealized cases, not from self-definitional metadata or self-citation load-bearing steps. This matches the default expectation of no circularity for non-derivational work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
TicToc: Enabling Bandwidth-Efficient DRAM Caching for both Hits and Misses in Hybrid Memory Systems
INTRODUCTION As memory systems scale, non-volatile memories or NVMs (such as, 3D-XPoint [1]) are emerging as viable alternatives to DRAM. NVMs offer the advantages of higher bit density and the ability to retain data after power outages. However, NVMs also have significant limitations that prevent them from outright replacing DRAM in the memory hierarchy. ...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[2]
A DRAM cache design has to balance multiple goals
BACKGROUND AND MOTIV ATION DRAM caches are important for enabling heterogeneous memory systems to have the effective latency and bandwidth of one memory technology, and the capacity of another; how- ever, there are several challenges in designing DRAM caches. A DRAM cache design has to balance multiple goals. First, it should minimize the SRAM storage nee...
-
[3]
We extend USIMM to include a DRAM cache
METHODOLOGY 3.1 Framework and Configuration We use USIMM [20], an x86 simulator with detailed mem- ory system model. We extend USIMM to include a DRAM cache. Table 2 shows the configuration used in our study. We assume a four-level cache hierarchy (L1, L2, L3 being on- chip SRAM caches and L4 being off-chip DRAM cache). All caches use 64B line size. We mode...
work page 2006
-
[4]
TICTOC DESIGN DRAM caches need metadata to confirm if a line is cache resident or not (tag bits), and if the resident line is the most up-to-date copy (dirty bit). Tag-Inside-Cacheline (TIC) or- ganizations are optimized for hits as one access gets both metadata and data, but can suffer for misses as misses still need to access DRAM for metadata. In contra...
-
[5]
REDUCING INSTALL BANDWIDTH WITH WRITE-A W ARE BYPASS When data has poor reuse, installing lines and updating TOC metadata wastes bandwidth. In fact, in such cases, em- ploying a DRAM cache could actually hurt performance, as the line install and tag maintenance operations needlessly steal bus bandwidth from memory accesses. Figure 13 shows the performance...
-
[6]
Due to space constraints, we limit results to TicToc with dirty-bit optimizations
RESULTS AND DISCUSSION In this section we present sensitivity studies and storage analysis. Due to space constraints, we limit results to TicToc with dirty-bit optimizations. 6.1 Storage Requirements We analyze the SRAM storage requirements of our TicToc organization. TicToc requires structures from its component TIC and TOC organizations. Inheriting from...
-
[7]
RELATED WORK 7.1 Line-based DRAM Caches In our work, we utilize and combine the two major types of line-granularity DRAM cache designs: Tag-Inside-Cacheline (TIC) and Tag-Outside-Cacheline (TOC) approaches. TIC designs [7, 11, 18, 30, 31, 32] organize their cache as direct-mapped and store tag inside the cacheline, such that one access can retrieve both t...
-
[8]
CONCLUSION This paper investigates bandwidth-efficient DRAM caching for hybrid DRAM + 3D-XPoint memories. Effective DRAM caching in front of 3D-XPoint is critical to enabling a mem- ory system that has the apparent high-capacity of 3D-XPoint, and the low-latency and high-write-bandwidth of DRAM. There are two currently major approaches for DRAM cache desig...
-
[9]
A revolutionary breakthrough in memory technology,
Intel and Micron, “A revolutionary breakthrough in memory technology,” 2015
work page 2015
-
[10]
Basic performance measurements of the intel optane DC persistent memory module,
J. Izraelevitz, J. Yang, L. Zhang, J. Kim, X. Liu, A. Memaripour, Y . J. Soh, Z. Wang, Y . Xu, S. R. Dulloor, J. Zhao, and S. Swanson, “Basic performance measurements of the intel optane DC persistent memory module,” CoRR, vol. abs/1903.05714, 2019
-
[11]
Intel© optane™ dc persistent memory operating modes explained,
A. Ilkbahar, “Intel© optane™ dc persistent memory operating modes explained,” 2018. Accessed: 2019-03-20
work page 2018
-
[12]
Scalable high performance main memory system using phase-change memory technology,
M. K. Qureshi, V . Srinivasan, and J. A. Rivers, “Scalable high performance main memory system using phase-change memory technology,” in Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA ’09, (New York, NY , USA), pp. 24–33, ACM, 2009
work page 2009
-
[13]
Pdram: A hybrid pram and dram main memory system,
G. Dhiman, R. Ayoub, and T. Rosing, “Pdram: A hybrid pram and dram main memory system,” in 2009 46th ACM/IEEE Design Automation Conference, pp. 664–669, July 2009
work page 2009
-
[14]
Architectural design for next generation heterogeneous memory systems,
A. Bivens, P. Dube, M. Franceschini, J. Karidis, L. Lastras, and M. Tsao, “Architectural design for next generation heterogeneous memory systems,” in Memory Workshop (IMW), 2010 IEEE International, pp. 1–4, IEEE, 2010
work page 2010
-
[15]
M. K. Qureshi and G. H. Loh, “Fundamental latency trade-off in architecting dram caches: Outperforming impractical sram-tags with a simple and practical design,” in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 235–246, Dec 2012
work page 2012
-
[16]
Enabling efficient and scalable hybrid memories using fine-granularity dram cache management,
J. Meza, J. Chang, H. Yoon, O. Mutlu, and P. Ranganathan, “Enabling efficient and scalable hybrid memories using fine-granularity dram cache management,” IEEE Computer Architecture Letters, vol. 11, pp. 61–64, July 2012
work page 2012
-
[17]
Efficiently enabling conventional block sizes for very large die-stacked dram caches,
G. H. Loh and M. D. Hill, “Efficiently enabling conventional block sizes for very large die-stacked dram caches,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, (New York, NY , USA), pp. 454–464, ACM, 2011
work page 2011
-
[18]
D. Jevdjic, S. V olos, and B. Falsafi, “Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, (New York, NY , USA), pp. 404–415, ACM, 2013
work page 2013
-
[19]
Knights landing: Second-generation intel xeon phi product,
A. Sodani, R. Gramunt, J. Corbal, H.-S. Kim, K. Vinod, S. Chinthamani, S. Hutsell, R. Agarwal, and Y .-C. Liu, “Knights landing: Second-generation intel xeon phi product,” IEEE Micro, vol. 36, pp. 34–46, Mar 2016
work page 2016
-
[20]
Resilient die-stacked dram caches,
J. Sim, G. H. Loh, V . Sridharan, and M. O’Connor, “Resilient die-stacked dram caches,” in Proceedings of the 40th Annual International Symposium on Computer Architecture, ISCA ’13, (New York, NY , USA), pp. 416–427, ACM, 2013
work page 2013
-
[21]
Unison cache: A scalable and effective die-stacked dram cache,
D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi, “Unison cache: A scalable and effective die-stacked dram cache,” in Microarchitecture (MICRO), 2014 47th Annual IEEE/ACM International Symposium on , pp. 25–37, IEEE, 2014
work page 2014
-
[22]
High bandwidth memory (hbm) dram,
J. Standard, “High bandwidth memory (hbm) dram,” JESD235, 2013
work page 2013
-
[23]
JEDEC, DDR4 SPEC (JESD79-4), 2013
work page 2013
-
[24]
Intel’s crazy-fast 3d xpoint optane memory heads for ddr slots (but with a catch),
ArsTechnica, “Intel’s crazy-fast 3d xpoint optane memory heads for ddr slots (but with a catch),” 2018. Accessed: 2019-01-23
work page 2018
-
[25]
Cascade lake: Next generation intel xeon scalable processor,
M. Arafa, B. Fahim, S. Kottapalli, A. Kumar, L. P. Looi, S. Mandava, A. Rudoff, I. M. Steiner, B. Valentine, G. Vedaraman, and S. V ora, “Cascade lake: Next generation intel xeon scalable processor,” IEEE Micro, vol. 39, pp. 29–36, March 2019
work page 2019
-
[26]
Bear: Techniques for mitigating bandwidth bloat in gigascale dram caches,
C. Chou, A. Jaleel, and M. K. Qureshi, “Bear: Techniques for mitigating bandwidth bloat in gigascale dram caches,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture, ISCA ’15, (New York, NY , USA), pp. 198–210, ACM, 2015
work page 2015
-
[27]
Sector cache design and performance,
J. B. Rothman and A. J. Smith, “Sector cache design and performance,” in Proceedings 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (Cat. No.PR00728), pp. 124–133, Aug 2000
work page 2000
-
[28]
Usimm: the utah simulated memory module,
N. Chatterjee, R. Balasubramonian, M. Shevgoor, S. Pugsley, A. Udipi, A. Shafiee, K. Sudan, M. Awasthi, and Z. Chishti, “Usimm: the utah simulated memory module,” University of Utah, Tech. Rep, 2012
work page 2012
-
[29]
Fact sheet: New intel architectures and technologies target expanded market opportunities,
Intel, “Fact sheet: New intel architectures and technologies target expanded market opportunities,” 2018. Accessed: 2019-03-20
work page 2018
-
[30]
Pinpointing representative portions of large intel itanium programs with dynamic instrumentation,
H. Patil, R. Cohn, M. Charney, R. Kapoor, A. Sun, and A. Karunanidhi, “Pinpointing representative portions of large intel itanium programs with dynamic instrumentation,” in Microarchitecture, 2004. MICRO-37 2004. 37th International Symposium on, pp. 81–92, Dec 2004
work page 2004
-
[31]
Spec cpu2006 benchmark descriptions,
J. L. Henning, “Spec cpu2006 benchmark descriptions,” SIGARCH Comput. Archit. News, vol. 34, pp. 1–17, Sept. 2006
work page 2006
-
[32]
S. Beamer, K. Asanovic, and D. A. Patterson, “The GAP benchmark suite,” CoRR, vol. abs/1508.03619, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[33]
A mostly-clean dram cache for effective hit speculation and self-balancing dispatch,
J. Sim, G. H. Loh, H. Kim, M. O’Connor, and M. Thottethodi, “A mostly-clean dram cache for effective hit speculation and self-balancing dispatch,” in Microarchitecture (MICRO), 2012 45th Annual IEEE/ACM International Symposium on , pp. 247–257, IEEE, 2012
work page 2012
-
[34]
Ship: Signature-based hit predictor for high performance caching,
C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, Jr., and J. Emer, “Ship: Signature-based hit predictor for high performance caching,” in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-44, (New York, NY , USA), pp. 430–441, ACM, 2011
work page 2011
-
[35]
Ship++: Enhancing signature-based hit predictor for improved cache performance,
V . Young, C.-C. Chou, A. Jaleel, and M. Qureshi, “Ship++: Enhancing signature-based hit predictor for improved cache performance,” in The 2nd Cache Replacement Championship (CRC-2 Workshop in ISCA 2017), 2017
work page 2017
-
[36]
Counter-based cache replacement and bypassing algorithms,
M. Kharbutli and Y . Solihin, “Counter-based cache replacement and bypassing algorithms,” IEEE Trans. Comput., vol. 57, pp. 433–447, Apr. 2008
work page 2008
-
[37]
A dueling segmented lru replacement algorithm with adaptive bypassing,
H. Gao and C. Wilkerson, “A dueling segmented lru replacement algorithm with adaptive bypassing,” in JWAC 2010-1st JILP Worshop on Computer Architecture Competitions: cache replacement Championship, 2010
work page 2010
-
[38]
V . Young, C. Chou, A. Jaleel, and M. K. Qureshi, “Accord: Enabling associativity for gigascale dram caches by coordinating way-install and way-prediction,” in 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pp. 328–339, June 2018
work page 2018
-
[39]
Candy: Enabling coherent dram caches for multi-node systems,
C. Chou, A. Jaleel, and M. K. Qureshi, “Candy: Enabling coherent dram caches for multi-node systems,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–13, Oct 2016
work page 2016
-
[40]
Dice: Compressing dram caches for bandwidth and capacity,
V . Young, P. J. Nair, and M. K. Qureshi, “Dice: Compressing dram caches for bandwidth and capacity,” in ISCA ’17, (New York, NY , USA), pp. 627–638, ACM, 2017
work page 2017
-
[41]
Atcache: reducing dram cache latency via a small sram tag cache,
C.-C. Huang and V . Nagarajan, “Atcache: reducing dram cache latency via a small sram tag cache,” in Proceedings of the 23rd international conference on Parallel architectures and compilation, pp. 51–60, ACM, 2014
work page 2014
-
[42]
Building a low latency, highly associative dram cache with the buffered way predictor,
Z. Wang, D. A. JimÃl’nez, T. Zhang, G. H. Loh, and Y . Xie, “Building a low latency, highly associative dram cache with the buffered way predictor,” in 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 109–117, Oct 2016
work page 2016
-
[43]
A fully associative, tagless dram cache,
Y . Lee, J. Kim, H. Jang, H. Yang, J. Kim, J. Jeong, and J. W. Lee, “A fully associative, tagless dram cache,” in Proceedings of the 42Nd Annual International Symposium on Computer Architecture , ISCA ’15, (New York, NY , USA), pp. 211–222, ACM, 2015
work page 2015
-
[44]
Efficient footprint caching for tagless dram caches,
H. Jang, Y . Lee, J. Kim, Y . Kim, J. Kim, J. Jeong, and J. W. Lee, “Efficient footprint caching for tagless dram caches,” in High Performance Computer Architecture (HPCA), 2016 IEEE International Symposium on, pp. 237–248, IEEE, 2016
work page 2016
-
[45]
Challenges in heterogeneous die-stacked and off-chip memory systems,
G. H Loh, N. Jayasena, J. Chung, S. K Reinhardt, M. O’Connor, and K. McGrath, “Challenges in heterogeneous die-stacked and off-chip memory systems,” in 3rd Workshop on SoCs, Heterogeneous Architectures and Workloads (SHA W-3), 02 2012
work page 2012
-
[46]
Banshee: Bandwidth-efficient dram caching via software/hardware cooperation,
X. Yu, C. J. Hughes, N. Satish, O. Mutlu, and S. Devadas, “Banshee: Bandwidth-efficient dram caching via software/hardware cooperation,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-50 ’17, (New York, NY , 12 USA), pp. 1–14, ACM, 2017
work page 2017
-
[47]
C. Chou, A. Jaleel, and M. K. Qureshi, “Cameo: A two-level memory organization with capacity of main memory and flexibility of hardware-managed cache,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, (Washington, DC, USA), pp. 1–12, IEEE Computer Society, 2014
work page 2014
-
[48]
Transparent hardware management of stacked dram as part of memory,
J. Sim, A. R. Alameldeen, Z. Chishti, C. Wilkerson, and H. Kim, “Transparent hardware management of stacked dram as part of memory,” in Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-47, (Washington, DC, USA), pp. 13–24, IEEE Computer Society, 2014
work page 2014
-
[49]
Silc-fm: Subblocked interleaved cache-like flat memory organization,
J. H. Ryoo, M. R. Meswani, A. Prodromou, and L. K. John, “Silc-fm: Subblocked interleaved cache-like flat memory organization,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 349–360, Feb 2017
work page 2017
-
[50]
A. Prodromou, M. Meswani, N. Jayasena, G. Loh, and D. M. Tullsen, “Mempod: A clustered architecture for efficient and scalable migration in flat address space multi-level memories,” in 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 433–444, Feb 2017
work page 2017
-
[51]
Pageseer: Using page walks to trigger page swaps in hybrid memory systems,
A. Kokolis, “Pageseer: Using page walks to trigger page swaps in hybrid memory systems,” 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 596–608, 2019
work page 2019
-
[52]
C3d: Mitigating the numa bottleneck via coherent dram caches,
C. Huang, R. Kumar, M. Elver, B. Grot, and V . Nagarajan, “C3d: Mitigating the numa bottleneck via coherent dram caches,” in 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1–12, Oct 2016
work page 2016
-
[53]
Cache coherence for GPU architectures,
I. Singh, A. Shriraman, W. W. L. Fung, M. O’Connor, and T. M. Aamodt, “Cache coherence for GPU architectures,” in 19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013, Shenzhen, China, February 23-27, 2013 , 2013
work page 2013
-
[54]
Combining hw/sw mechanisms to improve numa performance of multi-gpu systems,
V . Young, A. Jaleel, E. Bolotin, E. Ebrahimi, D. Nellans, and O. Villa, “Combining hw/sw mechanisms to improve numa performance of multi-gpu systems,” in MICRO ’18, October 2018. 13
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.