Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM
read the original abstract
Bitwise operations are an important component of modern day programming. Many widely-used data structures (e.g., bitmap indices in databases) rely on fast bitwise operations on large bit vectors to achieve high performance. Unfortunately, in existing systems, regardless of the underlying architecture (e.g., CPU, GPU, FPGA), the throughput of such bulk bitwise operations is limited by the available memory bandwidth. We propose Buddy, a new mechanism that exploits the analog operation of DRAM to perform bulk bitwise operations completely inside the DRAM chip. Buddy consists of two components. First, simultaneous activation of three DRAM rows that are connected to the same set of sense amplifiers enables us to perform bitwise AND and OR operations. Second, the inverters present in each sense amplifier enables us to perform bitwise NOT operations, with modest changes to the DRAM array. These two components make Buddy functionally complete. Our implementation of Buddy largely exploits the existing DRAM structure and interface, and incurs low overhead (1% of DRAM chip area). Our evaluations based on SPICE simulations show that, across seven commonly-used bitwise operations, Buddy provides between 10.9X---25.6X improvement in raw throughput and 25.1X---59.5X reduction in energy consumption. We evaluate three real-world data-intensive applications that exploit bitwise operations: 1) bitmap indices, 2) BitWeaving, and 3) bitvector-based implementation of sets. Our evaluations show that Buddy significantly outperforms the state-of-the-art.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
Clutch: High Performance Vector-Scalar Comparison using DRAM via Chunked Temporal Coding
Clutch accelerates vector-scalar comparisons in PuD systems via chunked temporal coding, delivering 2.9x throughput and 3.0x energy gains over prior bit-serial PuD while also mapping decision tree inference to PuD for...
-
PuDGhost: Experimental Analysis of Computation Result Corruption in Processing-using-DRAM Operations on Real DRAM Chips and Implications for Future Systems
PuDGhost causes up to 48% error in SiMRA-based PuD computations due to row and column interference, quantified on 96 real DDR4 chips with proposed mitigations like column screening and row layout changes.
-
HE-PIM: Demystifying Homomorphic Operations on a Real-world Processing-in-Memory System
Characterization of HE kernels on commercial UPMEM PIM identifies modular multiplication and per-bank capacity as dominant bottlenecks and concludes PIM becomes competitive with CPU/GPU once those are addressed.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.