Harmonia: Enhancing Data Placement and Migration in Hybrid Storage Systems via Multi-Agent Reinforcement Learning

Andreas Kakolyris; Gagandeep Singh; Ismail Emir Yuksel; Jisung Park; Mohammad Sadrosadati; Onur Mutlu; Rahul Bera; Rakesh Nadig; Taha Shahroodi; Vamanan Arulchelvan

read the original abstract

Modern high-performance computing (HPC) environments rely on hybrid storage systems (HSS) that combine multiple storage devices with diverse latency, bandwidth, endurance, and capacity characteristics to meet the performance, capacity, and cost requirements of data-intensive applications. The performance of an HSS highly depends on two key data-management policies: (1) data placement, which determines the most suitable storage device to store application data, and (2) data migration, which dynamically reorganizes previously-stored data across storage devices (i.e., prefetching hot data and evicting cold data) to sustain high HSS performance. These policies are tightly interdependent, and thus, improving one without considering the other leads to suboptimal HSS performance. Unfortunately, prior works focus on optimizing only one of the policies. Our goal is to design a holistic data-management technique that optimizes both data-placement and data-migration policies to fully exploit the potential of an HSS. To this end, we propose Harmonia, a multi-agent reinforcement learning (RL)-based data-management technique. Harmonia employs two lightweight autonomous RL agents, a data-placement agent and a data-migration agent, that adapt their policies for the current workload and HSS configuration while coordinating with each other. We evaluate Harmonia on real HSS configurations with up to four heterogeneous storage devices and 25 data-intensive workloads. On a performance- (cost-) optimized HSS with two heterogeneous storage devices, Harmonia outperforms the best-performing prior approach by 29.3% (44.8%) on average. On an HSS with three (four) devices, Harmonia outperforms the best-performing prior work by 38.9% (39.2%) on average. Harmonia's performance benefits come with low latency (240 ns for inference) and storage (206 KiB in DRAM for both RL agents combined) overheads.

Harmonia: Enhancing Data Placement and Migration in Hybrid Storage Systems via Multi-Agent Reinforcement Learning

discussion (0)