arxiv: 2605.08908 · v2 · submitted 2026-05-09 · 💻 cs.AR

Recognition: 2 theorem links

· Lean Theorem

HyDRA: Deadline and Reuse-Aware Cacheability for Hardware Accelerators

Ayushi Agarwal , Anannya Mathur , Preeti Ranjan Panda

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:20 UTC · model grok-4.3

classification 💻 cs.AR

keywords hardware acceleratorsshared cachereuse predictiondeadline awarenesscache bypassingheterogeneous SoCsperformance optimizationclustering predictor

0 comments

The pith

HyDRA uses a clustering predictor to balance accelerator deadline constraints with reuse-aware bypassing in shared caches of heterogeneous SoCs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the tension in heterogeneous SoCs where accelerators' strict deadlines can degrade core performance when both share a system-level cache. It proposes LERN, a clustering-based method that learns and predicts accelerator reuse patterns specifically at the shared cache level, which differs from core reuse behavior due to architectural differences. HyDRA then applies these predictions to decide on cache bypassing, trading off reuse awareness against deadline awareness to maximize overall system throughput. The strategy dynamically adjusts decisions to meet accelerator timing requirements while reducing unnecessary cache pollution. Evaluations across varied workloads and accelerator setups show gains in performance alongside lower deadline miss rates.

Core claim

HyDRA is a deadline and reuse-aware cache management strategy that employs the LERN clustering-based predictor to dynamically predict the reuse behavior of accelerator accesses at the shared cache and make bypass decisions that maximize system throughput while meeting accelerator deadlines.

What carries the argument

LERN, a clustering-based methodology for learning and predicting the reuse behavior of hardware accelerators at the shared cache level, which drives HyDRA's bypass decisions.

Load-bearing premise

The clustering-based LERN predictor accurately captures accelerator reuse behavior at the shared cache level and enables effective bypass decisions without violating deadlines.

What would settle it

Measure LERN prediction accuracy against observed reuse patterns in a real or simulated heterogeneous SoC; if bypass decisions increase deadline misses or yield no throughput improvement over baseline reuse predictors, the central claim fails.

Figures

Figures reproduced from arXiv: 2605.08908 by Anannya Mathur, Ayushi Agarwal, Preeti Ranjan Panda.

**Figure 2.** Figure 2: Key Motivational Challenges 1, 2, and 3. Mean per [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Limitations of the state-of-the-art cache management [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: LERN Methodology Overview: Reuse Count and Reuse [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: PCA projection in 2-D for the 4-D RI features [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Percentage of memory accesses clustered in different [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: HyDRA’s Deadline with Reuse-Aware Bypass and [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Margin Requirement Estimation based on the acceler [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Clusters to be bypassed determined by the reuse [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Performance Evaluation of HyDRA (ARP-CS-AL-D) on Accelerator Config-1. Deadline: 10 IPS. [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Accelerator’s shared cache access rate during the execution compared with the per-epoch progress requirement based [PITH_FULL_IMAGE:figures/full_fig_p012_11.png] view at source ↗

**Figure 12.** Figure 12: Performance Evaluation of HyDRA across Accelerator Configurations Config-1 to Config-10. Deadline: 10 IPS. [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Performance Comparison of HyDRA with FIFO-NB and ARP-CS-AS-D. Deadline: 10 IPS. [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗

**Figure 14.** Figure 14: Comparison of the shared cache space occupied by all cores (-C) and the accelerator (-A) during the execution with [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 15.** Figure 15: Performance achieved by policies with similar accelerator bypass rates as HyDRA. Deadline: 10 IPS. [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: Performance Evaluation with varying LLC capacity. [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

**Figure 18.** Figure 18: Performance evaluation with 2-way cache partitioning [PITH_FULL_IMAGE:figures/full_fig_p017_18.png] view at source ↗

**Figure 19.** Figure 19: Performance evaluation of HyDRA with different LERN predictor table entries. LERN is trained on hashed addresses. [PITH_FULL_IMAGE:figures/full_fig_p018_19.png] view at source ↗

**Figure 20.** Figure 20: Performance evaluation of HyDRA over FIFO-NB and SHIP-driven bypass with different SHIP predictor table size [PITH_FULL_IMAGE:figures/full_fig_p018_20.png] view at source ↗

read the original abstract

The system-level cache is a critical resource shared by processor cores and domain-specific accelerators in heterogeneous systems on chips (SoCs). The strict QoS requirements of accelerators, such as deadlines, can lead to severe performance degradation of processor cores. Thus, managing the shared cache efficiently between cores and accelerators becomes crucial. State-of-the-art cache management techniques perform reuse-aware bypassing of accesses from cores with the help of reuse predictors to improve performance. However, architectural differences between accelerators and processor cores (often associated with deep cache hierarchies) can lead to significantly different reuse patterns at the shared cache. We propose a novel clustering-based methodology, LERN, for learning and predicting the reuse behavior of hardware accelerators at the shared cache. We then propose a deadline and reuse-aware cache management strategy, HyDRA, which explores a novel tradeoff between reuse and deadline awareness for performance efficiency. It uses LERN to dynamically predict the reuse behavior of the accelerator accesses and make bypass decisions to maximize the system throughput while meeting accelerator deadlines. We evaluate HyDRA across different workloads and varied accelerator configurations. It significantly improves the system performance and reduces the accelerator deadline miss rate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HyDRA's LERN clustering for accelerator reuse at the shared cache is a sensible extension of core predictors, but the evaluation needs sensitivity checks on cluster parameters to support the performance and deadline claims.

read the letter

The main takeaway is that this paper proposes LERN, a clustering method to predict which accelerator lines are reusable at the LLC, then folds that into HyDRA's bypass policy that trades off reuse against meeting accelerator deadlines. It targets the practical issue in heterogeneous SoCs where accelerators' bursty patterns can starve cores or miss QoS targets if the shared cache is managed like a CPU cache. That framing is useful and the reuse-deadline tradeoff is a direct response to accelerator behavior that standard predictors miss. The clustering step itself is the clearest novelty relative to the core-centric reuse work cited. The paper does a clean job laying out why accelerator access patterns differ and why a dedicated predictor at the shared level could help. The evaluation is described as covering multiple workloads and accelerator configurations, which is the right direction. The soft spot is the lack of robustness evidence around LERN itself. The stress-test concern holds: without showing how results change when the number of clusters varies or when training data excludes the test accelerator type, it is hard to know whether the reported gains and deadline compliance are stable or tied to the specific setups. If the clusters overfit, the bypass decisions could either increase misses or waste space. The abstract states the improvements but the available text does not include the quantitative breakdowns or cross-validation steps that would make the claims easy to trust. This paper is for people working on cache policies for SoCs that mix cores and accelerators, especially in embedded or real-time settings. A reader who needs concrete ideas for handling accelerator QoS at the LLC would get something usable from it, even if the numbers require more scrutiny. It deserves peer review. The problem is real, the proposal is concrete, and referees can push for the missing sensitivity analysis and fuller data without starting from scratch.

Referee Report

2 major / 2 minor

Summary. The paper proposes LERN, a clustering-based methodology to learn and predict the reuse behavior of hardware accelerators at the shared last-level cache in heterogeneous SoCs, and HyDRA, a deadline- and reuse-aware cache management policy that uses LERN predictions to make dynamic bypass decisions. The central claim is that HyDRA improves overall system throughput while meeting accelerator deadlines, with evaluation across workloads and accelerator configurations showing significant performance gains and reduced deadline miss rates.

Significance. If the central claims hold with robust validation, the work would be significant for cache management in heterogeneous SoCs, where accelerators impose strict QoS constraints that conflict with CPU performance. The clustering approach tailored to accelerator reuse patterns (distinct from CPU patterns) and the explicit tradeoff between reuse awareness and deadline compliance represent a targeted contribution beyond standard reuse predictors.

major comments (2)

[Evaluation] Evaluation section: The abstract and manuscript state that HyDRA 'significantly improves the system performance and reduces the accelerator deadline miss rate' across workloads and configurations, yet supply no quantitative results (e.g., speedup percentages, miss-rate deltas), error bars, workload characteristics, accelerator configurations, or methodology details for deadline enforcement and measurement. This prevents assessment of whether the gains are load-bearing or artifacts of the setup.
[LERN and Evaluation] LERN methodology and evaluation: The claim that LERN's clustering reliably identifies reusable vs. non-reusable accelerator lines at the LLC (enabling effective bypass without deadline violations) lacks sensitivity analysis on cluster count k, feature selection, distance metric, or cross-configuration validation (e.g., training on one accelerator type and testing on another). Given bursty/streaming accelerator patterns, this is load-bearing for the robustness of HyDRA's bypass decisions.

minor comments (2)

[Introduction] The title uses 'Cacheability' but the abstract and text focus on bypass decisions; a brief clarification of the term in the introduction would improve precision.
[HyDRA] Notation for reuse predictors and deadline metrics could be standardized earlier (e.g., define all symbols before first use in the HyDRA policy description).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to strengthen the evaluation and analysis as requested.

read point-by-point responses

Referee: [Evaluation] Evaluation section: The abstract and manuscript state that HyDRA 'significantly improves the system performance and reduces the accelerator deadline miss rate' across workloads and configurations, yet supply no quantitative results (e.g., speedup percentages, miss-rate deltas), error bars, workload characteristics, accelerator configurations, or methodology details for deadline enforcement and measurement. This prevents assessment of whether the gains are load-bearing or artifacts of the setup.

Authors: We agree that quantitative details are necessary for rigorous assessment. The revised manuscript now includes specific performance speedups (with percentages), deadline miss rate reductions, error bars from repeated simulations, workload characteristics, accelerator configurations, and full methodology for deadline enforcement and measurement. These additions substantiate the claims and allow evaluation of whether gains are robust. revision: yes
Referee: [LERN and Evaluation] LERN methodology and evaluation: The claim that LERN's clustering reliably identifies reusable vs. non-reusable accelerator lines at the LLC (enabling effective bypass without deadline violations) lacks sensitivity analysis on cluster count k, feature selection, distance metric, or cross-configuration validation (e.g., training on one accelerator type and testing on another). Given bursty/streaming accelerator patterns, this is load-bearing for the robustness of HyDRA's bypass decisions.

Authors: We acknowledge the value of sensitivity analysis for validating LERN's clustering robustness, particularly for bursty accelerator patterns. The original submission emphasized end-to-end HyDRA results but omitted detailed parameter studies. The revision adds sensitivity analysis on cluster count k, feature selection, and distance metrics, along with cross-configuration validation (training on one accelerator type and testing on others) to confirm that LERN reliably distinguishes reusable lines without compromising deadline compliance. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces LERN as a novel clustering-based predictor for accelerator reuse at the shared cache and HyDRA as a deadline-aware bypass policy built on top of it. Both are presented as new proposals and evaluated empirically across workloads and accelerator configurations. No equations reduce fitted parameters to predictions by construction, no self-citations serve as load-bearing justifications for uniqueness or ansatzes, and the central claims rest on simulation results rather than self-referential definitions or renamings of known results. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard computer-architecture assumptions about cache behavior and workload characteristics; no free parameters or invented physical entities are visible in the abstract.

axioms (1)

domain assumption Accelerators exhibit significantly different reuse patterns from processor cores at the shared cache due to architectural differences.
Explicitly stated in the abstract as the motivation for a new predictor.

pith-pipeline@v0.9.0 · 5503 in / 1131 out tokens · 44217 ms · 2026-05-13T07:20:10.222574+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a novel clustering-based methodology, LERN, for learning and predicting the reuse behavior of hardware accelerators at the shared cache... K-Means clustering on the RI and RC features
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HyDRA... explores a novel tradeoff between reuse and deadline awareness... uses LERN to dynamically predict the reuse behavior

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

[1]

AMD Ryzen 7040 Series,

M. Subramony, D. Kramer, and I. Paul, “AMD Ryzen 7040 Series,” IEEE Micro, vol. 44, no. 03, pp. 18–24, May 2024

work page 2024
[2]

(2020) Snapdragon 888 5G Mobile Platform

Qualcomm. (2020) Snapdragon 888 5G Mobile Platform. [Online]. Available: www.qualcomm.com/products/snapdragon-888-5g-mobile- platform

work page 2020
[3]

(2020) Kirin 9000

HiSilicon. (2020) Kirin 9000. [Online]. Available: https://www.hisilicon. com/en/products/Kirin/Kirin-flagship-chips/Kirin-9000

work page 2020
[4]

The gem5 simulator,

N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, and A. Basu et al., “The gem5 simulator,”SIGARCH Comput. Archit. News, vol. 39, no. 2, p. 1–7, Aug. 2011

work page 2011
[5]

FLASH: Deadline-Aware Flexible LLC Arbitration and Scheduling for Hardware Accelerators,

A. Agarwal, P. Goel, P. Joseph, P. Ghosh, S. Roy, and P. R. Panda, “FLASH: Deadline-Aware Flexible LLC Arbitration and Scheduling for Hardware Accelerators,”ACM Trans. Embed. Comput. Syst., vol. 24, no. 6, Oct. 2025

work page 2025
[6]

Utility-Based Cache Partitioning: A Low- Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,

M. K. Qureshi and Y . N. Patt, “Utility-Based Cache Partitioning: A Low- Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” in2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06), 2006, pp. 423–432

work page 2006
[7]

(2015) White Paper — Improving Real-Time Performance by Utilizing Cache Allocation Technology — Enhancing Performance via Allocation of the Processor’s Cache

Intel. (2015) White Paper — Improving Real-Time Performance by Utilizing Cache Allocation Technology — Enhancing Performance via Allocation of the Processor’s Cache. [Online]. Avail- able: https://www.intel.com/content/dam/www/public/us/en/documents/ white-papers/cache-allocation-technology-white-paper.pdf

work page 2015
[8]

PACP: A Prefetch-aware Multi-core Shared Cache Partitioning Strategy,

J. Fang, Z. Nie, and L. Zhao, “PACP: A Prefetch-aware Multi-core Shared Cache Partitioning Strategy,” inProceedings of the 8th Interna- tional Conference on Computing and Artificial Intelligence, ser. ICCAI ’22. New York, NY , USA: ACM, 2022, p. 246–251

work page 2022
[9]

Page Reusability-Based Cache Partition- ing for Multi-Core Systems,

J. Park, H. Yeom, and Y . Son, “Page Reusability-Based Cache Partition- ing for Multi-Core Systems,”IEEE Transactions on Computers, vol. 69, no. 6, pp. 812–818, 2020

work page 2020
[10]

Heterogeneous Cache Hierarchy Management for Integrated CPU-GPU Architecture,

H. Wen and W. Zhang, “Heterogeneous Cache Hierarchy Management for Integrated CPU-GPU Architecture,” inIEEE High Performance Extreme Computing Conference (HPEC), 2019, pp. 1–6

work page 2019
[11]

TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture,

J. Lee and H. Kim, “TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture,” inIEEE International Symposium on High-Performance Comp Architecture, 2012, pp. 1–12

work page 2012
[12]

Perf&Fair: A Progress- Aware Scheduler to Enhance Performance and Fairness in SMT Multi- cores,

J. Feliu, J. Sahuquillo, S. Petit, and J. Duato, “Perf&Fair: A Progress- Aware Scheduler to Enhance Performance and Fairness in SMT Multi- cores,”IEEE Transactions on Computers, vol. 66, no. 5, pp. 905–911, 2017

work page 2017
[13]

REAL: REquest Arbitration in Last Level Caches,

S. Tiwari, S. Tuli, I. Ahmad, A. Agarwal, P. R. Panda, and S. Subra- money, “REAL: REquest Arbitration in Last Level Caches,”ACM Trans. Embed. Comput. Syst., vol. 18, no. 6, Nov 2019

work page 2019
[14]

Optimal bypass monitor for high performance last-level caches,

L. Li, D. Tong, Z. Xie, J. Lu, and X. Cheng, “Optimal bypass monitor for high performance last-level caches,” inProceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques, ser. PACT ’12, 2012, p. 315–324

work page 2012
[15]

SHiP: Signature-based Hit Predictor for high performance caching,

C.-J. Wu, A. Jaleel, W. Hasenplaugh, M. Martonosi, S. C. Steely, and J. Emer, “SHiP: Signature-based Hit Predictor for high performance caching,” in2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2011, pp. 430–441

work page 2011
[16]

FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks,

W. Lu, G. Yan, J. Li, S. Gong, Y . Han, and X. Li, “FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks,” in2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017, pp. 553–564

work page 2017
[17]

Eyeriss: An Energy- Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,

Y .-H. Chen, T. Krishna, J. S. Emer, and V . Sze, “Eyeriss: An Energy- Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks,”IEEE Journal of Solid-State Circuits, vol. 52, no. 1, pp. 127–138, 2017

work page 2017
[18]

CADOSys: Cache Aware Design Space Optimization for Spatial ML Accelerators,

R. Li, S. Ma, K. Kavi, G. Mehta, N. J. Yadwadkar, and L. K. John, “CADOSys: Cache Aware Design Space Optimization for Spatial ML Accelerators,” inProceedings of the Great Lakes Symposium on VLSI 2025, ser. GLSVLSI ’25, 2025, p. 200–207

work page 2025
[19]

High performance cache replacement using re-reference interval prediction (RRIP),

A. Jaleel, K. B. Theobald, S. C. Steely, and J. Emer, “High performance cache replacement using re-reference interval prediction (RRIP),” in Proceedings of the 37th Annual International Symposium on Computer Architecture, ser. ISCA’10. New York, NY , USA: ACM, 2010, p. 60–71

work page 2010
[20]

LRFU: a spectrum of policies that subsumes the least recently used and least frequently used policies,

D. Lee, J. Choi, J.-H. Kim, S. Noh, S. L. Min, Y . Cho, and C. S. Kim, “LRFU: a spectrum of policies that subsumes the least recently used and least frequently used policies,”IEEE Transactions on Computers, vol. 50, no. 12, pp. 1352–1361, 2001

work page 2001
[21]

A bypass first policy for energy-efficient last level caches,

J. J. K. Park, Y . Park, and S. Mahlke, “A bypass first policy for energy-efficient last level caches,” in2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS), 2016, pp. 63–70

work page 2016
[22]

Sampling Dead Block Prediction for Last-Level Caches,

S. M. Khan, Y . Tian, and D. A. Jim ´enez, “Sampling Dead Block Prediction for Last-Level Caches,” in2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010, pp. 175–186

work page 2010
[23]

(2025, May) Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined V olumes 2A, 2B, 2C, and 2D: Instruction Set Reference, A- Z

Intel Corporation. (2025, May) Intel® 64 and IA-32 Architectures Software Developer’s Manual Combined V olumes 2A, 2B, 2C, and 2D: Instruction Set Reference, A- Z. [Online]. Available: https://cdrdv2.intel.com/v1/dl/getContent/671110

work page 2025
[24]

Coupled data prefetch and cache partitioning scheme for cpu-accelerator system,

Z. Wang, C. Fu, and J. Han, “Coupled data prefetch and cache partitioning scheme for cpu-accelerator system,” in2023 IEEE 15th International Conference on ASIC (ASICON), 2023, pp. 1–4, doi: 10.1109/ASICON58565.2023.10396658

work page doi:10.1109/asicon58565.2023.10396658 2023
[25]

Predicting Reuse Interval for Optimized Web Caching: An LSTM-Based Machine Learning Approach,

P. Li, Y . Guo, and Y . Gu, “Predicting Reuse Interval for Optimized Web Caching: An LSTM-Based Machine Learning Approach,” inSC22: International Conference for High Performance Computing, Networking, Storage and Analysis, 2022, pp. 1–15

work page 2022
[26]

Multiperspective reuse prediction,

D. A. Jim ´enez and E. Teran, “Multiperspective reuse prediction,” in Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-50. New York, NY , USA: ACM, 2017, p. 436–448

work page 2017
[27]

Perceptron learning for reuse prediction,

E. Teran, Z. Wang, and D. A. Jim ´enez, “Perceptron learning for reuse prediction,” in2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2016, pp. 1–12

work page 2016
[28]

Applying deep learning to the cache replacement problem,

Z. Shi, X. Huang, A. Jain, and C. Lin, “Applying deep learning to the cache replacement problem,” inProceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-

work page
[29]

New York, NY , USA: ACM, 2019, p. 413–425

work page 2019
[30]

Designing a Cost-Effective Cache Replacement Policy using Machine Learning,

S. Sethumurugan, J. Yin, and J. Sartori, “Designing a Cost-Effective Cache Replacement Policy using Machine Learning,” inIEEE Interna- tional Symposium on High-Performance Computer Architecture (HPCA), 2021

work page 2021
[31]

Back to the Future: Leveraging Belady’s Algo- rithm for Improved Cache Replacement,

A. Jain and C. Lin, “Back to the Future: Leveraging Belady’s Algo- rithm for Improved Cache Replacement,” inACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016, pp. 78–89

work page 2016
[32]

Effective Mimicry of Belady’s MIN Policy,

I. Shah, A. Jain, and C. Lin, “Effective Mimicry of Belady’s MIN Policy,” in2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2022, pp. 558–572

work page 2022
[33]

Light-weight Cache Replacement for Instruction Heavy Workloads,

S. Mostofi, S. Gupta, A. Hassani, K. Tibrewala, E. Teran, P. V . Gratz, and D. A. Jim ´enez, “Light-weight Cache Replacement for Instruction Heavy Workloads,” inProceedings of the 52nd Annual International Symposium on Computer Architecture, ser. ISCA’25. New York, NY , USA: ACM, 2025, p. 1005–1019

work page 2025
[34]

An imitation learning approach for cache replacement,

E. Z. Liu, M. Hashemi, K. Swersky, P. Ranganathan, and J. Ahn, “An imitation learning approach for cache replacement,” inProceedings of the 37th International Conference on Machine Learning, ser. ICML’20. JMLR.org, 2020

work page 2020
[35]

RL-Based Cache Replacement: A Modern Interpretation of Belady’s Algorithm With Bypass Mechanism and Access Type Analysis,

H. J. Yoo, J. H. Kim, and T. H. Han, “RL-Based Cache Replacement: A Modern Interpretation of Belady’s Algorithm With Bypass Mechanism and Access Type Analysis,”IEEE Access, vol. 11, pp. 145 238–145 253, 2023

work page 2023
[36]

Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy,

J. Kim, E. Teran, P. V . Gratz, D. A. Jim ´enez, S. H. Pugsley, and C. Wilkerson, “Kill the Program Counter: Reconstructing Program Behavior in the Processor Cache Hierarchy,”SIGARCH Comput. Archit. 21 News, vol. 45, no. 1, p. 737–749, Apr. 2017. [Online]. Available: https://doi.org/10.1145/3093337.3037701

work page doi:10.1145/3093337.3037701 2017
[37]

Discrete Cache Insertion Policies for Shared Last Level Cache Management on Large Multicores,

A. Sridharan and A. Seznec, “Discrete Cache Insertion Policies for Shared Last Level Cache Management on Large Multicores,” in2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016, pp. 822–831

work page 2016
[38]

Learning memory access patterns,

M. Hashemi, K. Swersky, J. Smith, G. Ayers, H. Litz, J. Chang, C. Kozyrakis, and P. Ranganathan, “Learning memory access patterns,” inProceedings of the 35th International Conference on Machine Learn- ing, vol. 80. PMLR, 10–15 Jul 2018, pp. 1919–1928

work page 2018
[39]

In-Datacenter Performance Analysis of a Tensor Processing Unit,

N. P. Jouppiet al., “In-Datacenter Performance Analysis of a Tensor Processing Unit,” inProceedings of the 44th Annual International Symposium on Computer Architecture, ser. ISCA’17. New York, NY , USA: ACM, 2017, p. 1–12

work page 2017
[40]

A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim,

A. Samajdar, J. M. Joseph, Y . Zhu, P. Whatmough, M. Mattina, and T. Krishna, “A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim,” in2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2020, pp. 58–68

work page 2020
[41]

(2023) Kryo 585 application processor

Qualcomm. (2023) Kryo 585 application processor. [Online]. Available: https://docs.qualcomm.com/doc/80-PV086-5P/topic/processor.html

work page 2023
[42]

Toward Performance Portable Programming for Heterogeneous Systems on a Chip: A Case Study with Qualcomm Snapdragon SoC,

A. Cabrera, S. Hitefield, J. Kim, S. Lee, N. R. Miniskar, and J. S. Vetter, “Toward Performance Portable Programming for Heterogeneous Systems on a Chip: A Case Study with Qualcomm Snapdragon SoC,” in 2021 IEEE High Performance Extreme Computing Conference (HPEC), 2021, pp. 1–7

work page 2021
[43]

(2023) Qualcomm Robotics RB5 Develop- ment Kit: Processor Data Sheet

Qualcomm. (2023) Qualcomm Robotics RB5 Develop- ment Kit: Processor Data Sheet. [Online]. Avail- able: https://docs.qualcomm.com/doc/80-PV086-1/topic/80-PV086-1 REV E QRB5165 Data Sheet.pdf?product=1601111740013082

work page 2023
[44]

(2024) Qualcomm QCS8250 Processor

——. (2024) Qualcomm QCS8250 Processor. [Online]. Available: https://www.qualcomm.com/content/dam/qcomm-martech/ dm-assets/documents/qcs8250-soc-product-brief 87-pu792-1-c.pdf

work page 2024
[45]

NVIDIA ORIN System-On-Chip,

M. Ditty, “NVIDIA ORIN System-On-Chip,” in2022 IEEE Hot Chips 34 Symposium (HCS), 2022, pp. 1–17

work page 2022
[46]

L. James. (2025) MediaTek Goes All In on First All Big Core Chip for Smartphones . [Online]. Available: https://www.allaboutcircuits.com/ news/mediatek-goes-all-in-on-first-all-big-core-chip-for-smartphones/

work page 2025
[47]

Fast splittable pseudorandom number generators,

G. L. Steele, D. Lea, and C. H. Flood, “Fast splittable pseudorandom number generators,”SIGPLAN Not., vol. 49, no. 10, p. 453–472, 2014

work page 2014
[48]

(2024) Layerscape 2088A and 2048A Processors

NXP Semiconductors. (2024) Layerscape 2088A and 2048A Processors. [Online]. Available: https://www.nxp.com/products/LS2088A Ayushi Agarwalreceived her B.Tech in Electron- ics and Communication Engineering from Motilal Nehru National Institute of Technology Allahabad, in 2014. She is a research scholar in the ANSK School of Information Technology at the In...

work page 2024