Micro-architectural Analysis of OLAP: Limitations and Opportunities

Anastasia Ailamaki; Utku Sirin

arxiv: 1908.04718 · v1 · pith:64VGPZTRnew · submitted 2019-08-13 · 💻 cs.DB · cs.PF

Micro-architectural Analysis of OLAP: Limitations and Opportunities

Utku Sirin , Anastasia Ailamaki This is my paper

Pith reviewed 2026-05-24 16:20 UTC · model grok-4.3

classification 💻 cs.DB cs.PF

keywords OLAPmicro-architectural analysisCPU stallsmemory bandwidth utilizationcolumn storesperformance countersanalytical workloadsinstruction footprint

0 comments

The pith

High-performance OLAP engines spend 25 to 82 percent of CPU cycles on stalls and underutilize multi-core CPUs and memory bandwidth due to mismatched compute and memory demands.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures how modern OLAP systems behave at the level of CPU cycles, stalls, and memory bandwidth on contemporary hardware. Unlike OLTP systems, these analytical engines avoid instruction cache misses but still carry large instruction footprints that lengthen response times. Even tightly optimized column-store engines lose a quarter to more than three-quarters of their cycles to stalls, whether the queries touch data sequentially or randomly. The engines also fail to saturate available cores or memory bandwidth because their compute and memory requirements stay out of proportion. A reader would care because the measurements identify concrete hardware-level limits that any redesign of analytical engines must address.

Core claim

Traditional commercial OLAP systems exhibit large instruction footprints that slow response times yet do not suffer instruction cache misses. High-performance OLAP engines run tight instruction streams but still spend 25 to 82 percent of CPU cycles on stalls across both sequential- and random-access workloads. These same engines underutilize multi-core CPU resources and memory bandwidth because their compute and memory demands remain disproportional.

What carries the argument

Hardware performance counter measurements of CPU cycle breakdown, stall sources, and memory bandwidth utilization across multiple OLAP systems and query patterns.

If this is right

Analytical engines must deliberately balance compute and memory resource assignment to achieve efficient multi-core utilization.
Reducing stall cycles remains necessary even for engines that already use compact instruction streams.
Workload access pattern (sequential versus random) does not alter the high stall percentage observed.
Large instruction footprints in commercial OLAP systems continue to limit response times despite the absence of cache misses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of future column stores could explore tighter integration of compute and memory scheduling to close the observed utilization gap.
The same measurement approach could be applied to emerging hardware such as persistent memory or accelerators to check whether the stall and bandwidth patterns persist.
Query optimizers might incorporate stall and bandwidth models as cost factors once the disproportional demand pattern is confirmed across more systems.

Load-bearing premise

The specific OLAP systems, queries, and data sets measured are representative of production analytical workloads and the performance counters accurately reflect dominant bottlenecks.

What would settle it

Repeating the measurements on additional production OLAP engines or workloads that consistently show stall fractions below 25 percent or near-full memory bandwidth utilization would falsify the central claim.

Figures

Figures reproduced from arXiv: 1908.04718 by Anastasia Ailamaki, Utku Sirin.

**Figure 1.** Figure 1: CPU cycles breakdown for projection as projectivity increases for DBMS R and DBMS C. 1 Projectivity 0% 20% 40% 60% 80% 100% p1 p2 p3 p4 p1 p2 p3 p4 DBMS R DBMS C CPU cycles (%) Projectivity Stall Retiring 0% 20% 40% 60% 80% 100% p1 p2 p3 p4 p1 p2 p3 p4 DBMS R DBMS C Stall cycles (%) Projectivity Execution Dcache Decoding Icache Branch misp. 0% 20% 40% 60% 80% 100% p1 p2 p3 p4 p1 p2 p3 p4 Typer Tectorwise C… view at source ↗

**Figure 5.** Figure 5: Single-core sequential access bandwidth DBMS R DBMS C Typer Tectorwise [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 9.** Figure 9: shows the CPU cycles breakdown for Typer and Tectorwise. Unlike DBMS R and C, Typer and Tectorwise have the highest stall cycles ratio at 50% [PITH_FULL_IMAGE:figures/full_fig_p005_9.png] view at source ↗

**Figure 11.** Figure 11: CPU cycles breakdown for join for Join size [PITH_FULL_IMAGE:figures/full_fig_p006_11.png] view at source ↗

**Figure 15.** Figure 15: shows the CPU cycles breakdown. As can be seen, both Typer and Tectorwise have the highest Retiring cycles ratio when running Q1. Whereas Typer has the lowest Retiring cycles ratio when running Q9, and Tectorwise has the lowest Retiring cycles ratio when running Q6 [PITH_FULL_IMAGE:figures/full_fig_p007_15.png] view at source ↗

**Figure 19.** Figure 19: Response time breakdown for Tector Selectivity [PITH_FULL_IMAGE:figures/full_fig_p008_19.png] view at source ↗

**Figure 21.** Figure 21: Single-core bandwidth utilization for 1 me [PITH_FULL_IMAGE:figures/full_fig_p009_21.png] view at source ↗

**Figure 23.** Figure 23: Normalized stall time breakdown for Tectorwise when running projection and selection queries with and without SIMD [PITH_FULL_IMAGE:figures/full_fig_p009_23.png] view at source ↗

**Figure 25.** Figure 25: (left) shows the normalized response time breakdown, where the response time without SIMD is taken as the base. Note that the response time breakdown includes the stall time breakdown inside on the same graph [PITH_FULL_IMAGE:figures/full_fig_p010_25.png] view at source ↗

**Figure 26.** Figure 26: Response time breakdown for the six prefetcher configuration for Typer when running the projection micro-benchmark with degree of four [PITH_FULL_IMAGE:figures/full_fig_p010_26.png] view at source ↗

**Figure 27.** Figure 27: and 28 show the CPU and stall cycles breakdowns. As can be seen, both CPU and stall cycles breakdowns are similar to the single-core breakdowns. While the low-cardinality group by Q1 has the highest Retiring cycles ratio both for Typer and Tectorwise, join-intensive Q9 has the lowest Retiring cycles ratio for Typer, and highly selective filter Q6 has the lowest Retiring cycles ratio for Tectorwise. Whi… view at source ↗

**Figure 29.** Figure 29: Multi-core bandwidth utilization for Typer and Tectorwise when running the projection query with degree of four. Tectorwise can be improved by using hyper-threading. Our analysis with hyper-threading showed that the bandwidth utilization is improved by 1.3x both for Typer and Tectorwise. Hence, Tectorwise’s (together with SIMD) and Typer’s bandwidth utilizations would raise up to 40 GB/s and 27 GB/s when… view at source ↗

read the original abstract

Understanding micro-architectural behavior is profound in efficiently using hardware resources. Recent work has shown that, despite being aggressively optimized for modern hardware, in-memory online transaction processing (OLTP) systems severely underutilize their core micro-architecture resources [25]. Online analytical processing (OLAP) workloads, on the other hand, exhibit a completely different computing pattern. OLAP workloads are read-only, bandwidth-intensive and include various data access patterns including both sequential and random data accesses. In addition, with the rise of column-stores, they run on high performance engines that are tightly optimized for the efficient use of modern hardware. Hence, the micro-architectural behavior of modern OLAP systems remains unclear. This work presents the micro-architectural analysis of a breadth of OLAP systems. We examine CPU cycles and memory bandwidth utilization. The results show that, unlike the traditional, commercial OLTP systems, traditional, commercial OLAP systems do not suffer from instruction cache misses. Nevertheless, they suffer from their large instruction footprint resulting in slow response times. High performance OLAP engines execute tight instruction streams; however, they spend 25 to 82% of the CPU cycles on stalls regardless of the workload being sequential- or random-access-heavy. In addition, high performance OLAP engines underutilize the multi-core CPU or memory bandwidth resources due to their disproportional compute and memory demands. Hence, analytical processing engines should carefully assign their compute and memory resources for efficient multi-core micro-architectural utilization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper measures 25-82% stall cycles and resource imbalance in several OLAP column stores, extending prior OLTP profiling, but the stall ranges and tuning advice rest on how well the chosen engines and workloads represent real deployments.

read the letter

The main point is that high-performance OLAP engines still spend a large share of cycles stalled and fail to use available cores or memory bandwidth fully, even on both sequential and random workloads. This comes from their compute and memory demands being out of proportion, and the work shows they avoid the instruction cache misses seen in OLTP but still pay for large instruction footprints in response time.

Referee Report

2 major / 2 minor

Summary. The paper presents an empirical micro-architectural analysis of multiple OLAP systems using hardware performance counters to measure CPU cycle stalls and memory bandwidth utilization. It claims that, unlike OLTP systems, OLAP engines do not suffer from instruction cache misses but incur slow response times from large instruction footprints; high-performance OLAP engines spend 25-82% of CPU cycles on stalls irrespective of sequential or random access patterns and underutilize multi-core CPUs and memory bandwidth due to disproportional compute and memory demands, recommending careful resource assignment for analytical engines.

Significance. If the measurements generalize, the work supplies concrete quantitative evidence of substantial resource underutilization in modern column-store OLAP engines, identifying opportunities for hardware-software co-design improvements in bandwidth-intensive analytical workloads. The direct use of hardware counters yields falsifiable stall-fraction observations that could guide future engine optimizations.

major comments (2)

[Experimental Setup / Workload Selection] The central claim of a 25-82% stall range 'regardless of the workload being sequential- or random-access-heavy' (abstract) rests on the representativeness of the chosen OLAP engines, queries, and datasets. The experimental corpus must be shown to capture production instruction footprints, cache behavior, and bandwidth/compute balance; otherwise the attribution to 'disproportional compute and memory demands' does not necessarily extend beyond the measured instances.
[Results / Quantitative Observations] No error bars, standard deviations, raw counter values, or repeated-run statistics accompany the reported stall fractions and utilization numbers (abstract and results). This absence prevents verification that the observed ranges are robust to measurement overhead, configuration artifacts, or post-hoc workload selection.

minor comments (2)

[Abstract and Introduction] Clarify the distinction between 'traditional, commercial OLAP systems' and 'high performance OLAP engines' throughout the text to avoid ambiguity in the contrast with OLTP.
[Methodology] Provide explicit definitions or references for the hardware performance counter events used to classify stalls and bandwidth utilization.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our empirical analysis of OLAP micro-architectural behavior. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Experimental Setup / Workload Selection] The central claim of a 25-82% stall range 'regardless of the workload being sequential- or random-access-heavy' (abstract) rests on the representativeness of the chosen OLAP engines, queries, and datasets. The experimental corpus must be shown to capture production instruction footprints, cache behavior, and bandwidth/compute balance; otherwise the attribution to 'disproportional compute and memory demands' does not necessarily extend beyond the measured instances.

Authors: Our experiments cover multiple OLAP engines (both traditional row-stores and high-performance column-stores) and standard analytical benchmarks (TPC-H and Star Schema Benchmark) chosen specifically to include both sequential scan-heavy and random access patterns. These benchmarks are widely accepted as representative of production analytical workloads in the database literature. We will revise the manuscript to add an expanded discussion of workload selection criteria, including how the chosen queries exercise instruction footprints and cache behavior observed in the measured stall ranges. Direct reproduction of proprietary production traces is not possible, but the observed 25-82% stall range holds consistently across the selected corpus. revision: partial
Referee: [Results / Quantitative Observations] No error bars, standard deviations, raw counter values, or repeated-run statistics accompany the reported stall fractions and utilization numbers (abstract and results). This absence prevents verification that the observed ranges are robust to measurement overhead, configuration artifacts, or post-hoc workload selection.

Authors: We agree that the absence of statistical measures limits verifiability. Hardware counter readings were collected over repeated executions with stable system configurations to mitigate measurement artifacts, yet variance was not reported. In the revised version we will add error bars, standard deviations from multiple runs, and representative raw counter values for the key stall and bandwidth metrics to demonstrate robustness of the reported ranges. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study; no derivation chain or self-referential reduction

full rationale

The paper reports direct hardware-counter observations (stall fractions 25-82%, bandwidth/CPU underutilization) on chosen OLAP engines and workloads. No equations, fitted parameters, or predictions are defined in terms of the reported quantities. The single citation to prior OLTP work [25] is external contrast, not load-bearing for the OLAP measurements. The work is self-contained against external benchmarks (hardware counters on specific systems) and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The analysis rests on the assumption that the chosen commercial and high-performance OLAP engines plus the tested query mix are representative; no free parameters, mathematical axioms, or invented entities are introduced because the contribution is observational profiling rather than modeling.

pith-pipeline@v0.9.0 · 5802 in / 1145 out tokens · 18010 ms · 2026-05-24T16:20:20.105078+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 1 internal anchor

[1]

Micro-architectural Analysis of OLAP: Limitations and Opportunities

INTRODUCTION Online analytical processing (OLAP) is an ever-growing, multi-billion dollar industry. Many industrial and commu- nity organizations rely on fast and eﬃcient analytical pro- cessing to extract valuable information from their data. Un- derstanding micro-architectural behavior of OLAP systems, on the other hand, is profound in providing high pe...

work page internal anchor Pith review Pith/arXiv arXiv 1908
[2]

Benchmarks: We use micro-benchmarks and a subset of TPC-H queries [31]

SETUP & METHODOLOGY This section presents our experimental setup and method- ology. Benchmarks: We use micro-benchmarks and a subset of TPC-H queries [31]. We use projection, selection and join micro-benchmarks as they constitute the basic SQL opera- tors. All the systems use hash join algorithm when running the join micro-benchmark. We also performed a g...

work page 2018
[3]

VTune’s general- exploration provide full CPU cycles breakdown [26, 32]

analysis type for CPU cycles breakdown. VTune’s general- exploration provide full CPU cycles breakdown [26, 32]. We examine CPU cycles at two-levels. We ﬁrstly break down the CPU cycles into Retiring and Stall cycles. Retiring cy- cles represent the percentage of the useful cycles spent on retiring instructions. Stall cycles represent the percentage of th...

work page
[4]

Our goal is to observe how the micro-architectural behavior changes as the projec- tivity increases

PROJECTION This section presents the micro-architectural analysis of the projection micro-benchmark. Our goal is to observe how the micro-architectural behavior changes as the projec- tivity increases. Figure 1 shows the CPU cycles breakdown for DBMS R and C. The ﬁgure shows that while DBMS R spends about half of the CPU cycles for Retiring, DBMS C spends...

work page
[5]

Our goal is to examine how inﬂuential the branch mispredictions stalls are on the micro-architectural behavior

SELECTION Having examined the projection, we now move to examin- ing the selection micro-benchmark. Our goal is to examine how inﬂuential the branch mispredictions stalls are on the micro-architectural behavior. Figure 7 shows the CPU cy- cles breakdown for DBMS R and C. We observe that the Re- tiring cycles ratio increases as the selectivity increases bo...

work page
[6]

We force all the systems to use hash join algorithm

JOIN In this section, we examine the join micro-benchmark. We force all the systems to use hash join algorithm. Unlike the selection and projection micro-benchmarks with a sequential data access pattern, hash join includes many random data accesses. Our goal is to understand the eﬀect of random data accesses in the overall micro-architectural behavior. Fi...

work page 2000
[7]

TPC-H Up to now, we have examined simple micro-benchmarks. In this section, we analyze four TPC-H queries: Q1, Q6 and Q9, Q18, each of which represents a particular class of queres: (i) Q1 is a low-cardinality group by (4 groups), (ii) Q6 is a highly selective ﬁlter, (iii) Q9 is a join-intensive query and (iv) Q18 is a high-cardinality group by (1.5 mil- ...

work page
[8]

( % ( (

PREDICA TION In this section, we examine the predication optimization. Predication is used to eliminate branches. Its idea is to con- vert control dependencies to data dependencies by comput- ing the predicate as an arithmetic expression, and using it to increment the index/aggregation. The trade-oﬀ is doing more computation but avoid branches. Our goal i...

work page
[9]

SIMD in- structions are used to reduce the number of instructions re- quired to perform arithmetic operations

SIMD The second optimization we examine is SIMD. SIMD in- structions are used to reduce the number of instructions re- quired to perform arithmetic operations. We test Tector- wise when running the projection, selection and join micro- benchmarks with and without using the SIMD instructions. As our Broadwell server does not support AVX-512 instruc- tions,...

work page
[10]

Both the projection and predicated selection queries are essentially sequential scans of the relevant columns with a highly pre- dictable data access pattern

PREFETCHERS Section 3 and 7 have shown that the projection and pred- icated selection queries suﬀer from Dcache stalls. Both the projection and predicated selection queries are essentially sequential scans of the relevant columns with a highly pre- dictable data access pattern. Despite that, large Dcache stalls raise the question how useful hardware prefe...

work page
[11]

Most OLAP operations scale well across multi- cores

MULTI-CORE EXECUTION We lastly examine the hardware utilization for multi-core execution. Most OLAP operations scale well across multi- cores. As a result, we do not expect a big diﬀerence in the micro-architectural behavior of the multi-core execu- tion compared to the single-core execution. We use the four TPC-H queries as they are more complex than the...

work page
[12]

Ailamaki et al

RELA TED WORK There is a large body of work on the micro-architectural analysis of database workloads. Ailamaki et al. [2] and Hardavellas et al. [7] present database workload charac- terization both for analytical and transactional workloads. Tozun et al. [29, 30] presents micro-architectural analysis of disk-based OLTP systems. Sirin et al. [25] present...

work page 2018
[13]

We examine CPU cycles and memory bandwidth utilizations

CONCLUSIONS In this work, we evaluate the micro-architectural behav- ior of a breadth of OLAP systems from diﬀerent categories of systems and execution models. We examine CPU cycles and memory bandwidth utilizations. The results show that, unlike traditional, commercial OLTP systems, traditional, commercial OLAP systems do not suﬀer from instruction cache...

work page
[14]

Abadi, P

D. Abadi, P. Boncz, and S. Harizopoulos. The Design and Implementation of Modern Column-Oriented Database Systems. Now Publishers Inc., 2013

work page 2013
[15]

Ailamaki, D

A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a Modern Processor: Where Does Time Go? VLDB, pages 266–277, 1999

work page 1999
[16]

A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade. Performance Characterization of 12 UNDER SUBMISSION In-Memory Data Analytics on a Modern Cloud Server. BDCloud, pages 1–8, 2015

work page 2015
[17]

A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade. Micro-Architectural Characterization of Apache Spark on Batch and Stream Processing Workloads. BDCloud, pages 59–66, 2016

work page 2016
[18]

Boncz, T

P. Boncz, T. Grust, M. van Keulen, S. Manegold, J. Rittinger, and J. Teubner. MonetDB/XQuery: A Fast XQuery Processor Powered by a Relational Engine. SIGMOD, pages 479–490, 2006

work page 2006
[19]

Ferdman, A

M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsaﬁ. Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware. ASPLOS, pages 37–48, 2012

work page 2012
[20]

Hardavellas, I

N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril, A. Ailamaki, and B. Falsaﬁ. Database Servers on Chip Multiprocessors: Limitations and Opportunities. CIDR, pages 79–87, 2007

work page 2007
[21]

Idreos, F

S. Idreos, F. Groﬀen, N. Nes, S. Manegold, K. S. Mullender, and M. L. Kersten. MonetDB: Two Decades of Research in Column-oriented Database Architectures. IEEE Data Engineering Bulletin , 35(1):40–45, 2012

work page 2012
[22]

Disclosure of Hardware Prefetcher Control on Some Intel Processors

Intel. Disclosure of Hardware Prefetcher Control on Some Intel Processors. https://software.intel.com/en-us/articles/disclosure- of-hw-prefetcher-control-on-some-intel-processors

work page
[23]

Intel Memory Latency Checker

Intel. Intel Memory Latency Checker. https://software.intel.com/en-us/articles/intelr- memory-latency-checker

work page
[24]

Understanding How General Exploration Works in Intel VTune Ampliﬁer, 2018

Intel. Understanding How General Exploration Works in Intel VTune Ampliﬁer, 2018. https://software.intel.com/en- us/articles/understanding-how-general-exploration- works-in-intel-vtune-ampliﬁer-xe

work page 2018
[25]

Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, 2019

Intel. Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, 2019

work page 2019
[26]

Jonathan, U

C. Jonathan, U. F. Minhas, J. Hunter, J. Levandoski, and G. Nishanov. Exploiting Coroutines to Attack the ”Killer Nanoseconds”. Proc. VLDB Endow. , 11(11):1702–1714, July 2018

work page 2018
[27]

Kanev, J

S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G. Wei, and D. Brooks. Proﬁling a warehouse-scale computer. ISCA, pages 158–169, 2015

work page 2015
[28]

Karpathiotakis, I

M. Karpathiotakis, I. Alagiannis, and A. Ailamaki. Fast Queries over Heterogeneous Data Through Engine Customization. Proc. VLDB Endow. , 9(12):972–983, Aug. 2016

work page 2016
[29]

Kemper and T

A. Kemper and T. Neumann. Hyper: A hybrid oltp olap main memory database system based on virtual memory snapshots. ICDE, pages 195–206, 2011

work page 2011
[30]

Kersten, V

T. Kersten, V. Leis, A. Kemper, T. Neumann, A. Pavlo, and P. Boncz. Everything You Always Wanted to Know About Compiled and Vectorized Queries but Were Afraid to Ask. Proc. VLDB Endow. , 11(13):2209–2222, Sept. 2018

work page 2018
[31]

Lahiri, S

T. Lahiri, S. Chavan, M. Colgan, D. Das, A. Ganesh, M. Gleeson, S. Hase, A. Holloway, J. Kamp, T. Lee, J. Loaiza, N. Macnaughton, V. Marwah, N. Mukherjee, A. Mullick, S. Muthulingam, V. Raja, M. Roth, E. Soylemez, and M. Zait. Oracle Database In-Memory: A Dual Format In-memory Database. ICDE, pages 1253–1258, 2015

work page 2015
[32]

Larson, C

P.-A. Larson, C. Clinciu, E. N. Hanson, A. Oks, S. L. Price, S. Rangarajan, A. Surna, and Q. Zhou. SQL Server Column Store Indexes. SIGMOD, pages 1177–1184, 2011

work page 2011
[33]

Manegold, P

S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing Main-Memory Join on Modern Hardware. IEEE Trans. Knowl. Data Eng. , 14(4):709–730, 2002

work page 2002
[34]

Psaropoulos, T

G. Psaropoulos, T. Legler, N. May, and A. Ailamaki. Interleaving with Coroutines: A Practical Approach for Robust Index Joins. PVLDB, 11(2):230–242, 2017

work page 2017
[35]

Psaropoulos, T

G. Psaropoulos, T. Legler, N. May, and A. Ailamaki. Interleaving with Coroutines: A Systematic and Practical Approach to Hide Memory Latency in Index Joins. The VLDB Journal , Dec 2018

work page 2018
[36]

Psaropoulos, I

G. Psaropoulos, I. Oukid, T. Legler, N. May, and A. Ailamaki. Bridging the Latency Gap between NVM and DRAM for Latency-bound Operations. pages 13:1–13:8, 2019

work page 2019
[37]

Raman, G

V. Raman, G. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman, T. Malkemus, R. Mueller, I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, A. Storm, and L. Zhang. DB2 with BLU Acceleration: So Much More Than Just a Column Store. Proc. VLDB Endow. , 6(11):1080–1091, Aug. 2013

work page 2013
[38]

Sirin, P

U. Sirin, P. T¨ oz¨ un, D. Porobic, and A. Ailamaki. Micro-architectural Analysis of In-memory OLTP. SIGMOD, pages 387–402, 2016

work page 2016
[39]

Sirin, A

U. Sirin, A. Yasin, and A. Ailamaki. A Methodology for OLTP Micro-architectural Analysis. Damon, pages 1:1–1:10, 2017

work page 2017
[40]

Sompolski, M

J. Sompolski, M. Zukowski, and P. A. Boncz. Vectorization vs. Compilation in Query Execution. Damon, pages 33–40, 2011

work page 2011
[41]

Sridharan and J

S. Sridharan and J. M. Patel. Proﬁling R on a Contemporary Processor. Proc. VLDB Endow. , 8(2):173–184, Oct. 2014

work page 2014
[42]

T¨ oz¨ un, B

P. T¨ oz¨ un, B. Gold, and A. Ailamaki. OLTP in wonderland: Where do cache misses come from in major OLTP components? Damon, page 8, 2013

work page 2013
[43]

T¨ oz¨ un, I

P. T¨ oz¨ un, I. Pandis, C. Kaynak, D. Jevdjic, and A. Ailamaki. From A to E: Analyzing TPC’s OLTP Benchmarks: The Obsolete, The Ubiquitous, The Unexplored. EDBT, pages 17–28, 2013

work page 2013
[44]

Transcation Processing Performance Council

TPC. Transcation Processing Performance Council. http://www.tpc.org/

work page
[45]

A. Yasin. A Top-Down Method for Performance Analysis and Counters Architecture. ISPASS, pages 35–44, 2014

work page 2014
[46]

Yasin, Y

A. Yasin, Y. Ben-Asher, and A. Mendelson. Deep-dive Analysis of The Data Analytics Workload in CloudSuite. IISWC, pages 202–211, 2014. 13

work page 2014

[1] [1]

Micro-architectural Analysis of OLAP: Limitations and Opportunities

INTRODUCTION Online analytical processing (OLAP) is an ever-growing, multi-billion dollar industry. Many industrial and commu- nity organizations rely on fast and eﬃcient analytical pro- cessing to extract valuable information from their data. Un- derstanding micro-architectural behavior of OLAP systems, on the other hand, is profound in providing high pe...

work page internal anchor Pith review Pith/arXiv arXiv 1908

[2] [2]

Benchmarks: We use micro-benchmarks and a subset of TPC-H queries [31]

SETUP & METHODOLOGY This section presents our experimental setup and method- ology. Benchmarks: We use micro-benchmarks and a subset of TPC-H queries [31]. We use projection, selection and join micro-benchmarks as they constitute the basic SQL opera- tors. All the systems use hash join algorithm when running the join micro-benchmark. We also performed a g...

work page 2018

[3] [3]

VTune’s general- exploration provide full CPU cycles breakdown [26, 32]

analysis type for CPU cycles breakdown. VTune’s general- exploration provide full CPU cycles breakdown [26, 32]. We examine CPU cycles at two-levels. We ﬁrstly break down the CPU cycles into Retiring and Stall cycles. Retiring cy- cles represent the percentage of the useful cycles spent on retiring instructions. Stall cycles represent the percentage of th...

work page

[4] [4]

Our goal is to observe how the micro-architectural behavior changes as the projec- tivity increases

PROJECTION This section presents the micro-architectural analysis of the projection micro-benchmark. Our goal is to observe how the micro-architectural behavior changes as the projec- tivity increases. Figure 1 shows the CPU cycles breakdown for DBMS R and C. The ﬁgure shows that while DBMS R spends about half of the CPU cycles for Retiring, DBMS C spends...

work page

[5] [5]

Our goal is to examine how inﬂuential the branch mispredictions stalls are on the micro-architectural behavior

SELECTION Having examined the projection, we now move to examin- ing the selection micro-benchmark. Our goal is to examine how inﬂuential the branch mispredictions stalls are on the micro-architectural behavior. Figure 7 shows the CPU cy- cles breakdown for DBMS R and C. We observe that the Re- tiring cycles ratio increases as the selectivity increases bo...

work page

[6] [6]

We force all the systems to use hash join algorithm

JOIN In this section, we examine the join micro-benchmark. We force all the systems to use hash join algorithm. Unlike the selection and projection micro-benchmarks with a sequential data access pattern, hash join includes many random data accesses. Our goal is to understand the eﬀect of random data accesses in the overall micro-architectural behavior. Fi...

work page 2000

[7] [7]

TPC-H Up to now, we have examined simple micro-benchmarks. In this section, we analyze four TPC-H queries: Q1, Q6 and Q9, Q18, each of which represents a particular class of queres: (i) Q1 is a low-cardinality group by (4 groups), (ii) Q6 is a highly selective ﬁlter, (iii) Q9 is a join-intensive query and (iv) Q18 is a high-cardinality group by (1.5 mil- ...

work page

[8] [8]

( % ( (

PREDICA TION In this section, we examine the predication optimization. Predication is used to eliminate branches. Its idea is to con- vert control dependencies to data dependencies by comput- ing the predicate as an arithmetic expression, and using it to increment the index/aggregation. The trade-oﬀ is doing more computation but avoid branches. Our goal i...

work page

[9] [9]

SIMD in- structions are used to reduce the number of instructions re- quired to perform arithmetic operations

SIMD The second optimization we examine is SIMD. SIMD in- structions are used to reduce the number of instructions re- quired to perform arithmetic operations. We test Tector- wise when running the projection, selection and join micro- benchmarks with and without using the SIMD instructions. As our Broadwell server does not support AVX-512 instruc- tions,...

work page

[10] [10]

Both the projection and predicated selection queries are essentially sequential scans of the relevant columns with a highly pre- dictable data access pattern

PREFETCHERS Section 3 and 7 have shown that the projection and pred- icated selection queries suﬀer from Dcache stalls. Both the projection and predicated selection queries are essentially sequential scans of the relevant columns with a highly pre- dictable data access pattern. Despite that, large Dcache stalls raise the question how useful hardware prefe...

work page

[11] [11]

Most OLAP operations scale well across multi- cores

MULTI-CORE EXECUTION We lastly examine the hardware utilization for multi-core execution. Most OLAP operations scale well across multi- cores. As a result, we do not expect a big diﬀerence in the micro-architectural behavior of the multi-core execu- tion compared to the single-core execution. We use the four TPC-H queries as they are more complex than the...

work page

[12] [12]

Ailamaki et al

RELA TED WORK There is a large body of work on the micro-architectural analysis of database workloads. Ailamaki et al. [2] and Hardavellas et al. [7] present database workload charac- terization both for analytical and transactional workloads. Tozun et al. [29, 30] presents micro-architectural analysis of disk-based OLTP systems. Sirin et al. [25] present...

work page 2018

[13] [13]

We examine CPU cycles and memory bandwidth utilizations

CONCLUSIONS In this work, we evaluate the micro-architectural behav- ior of a breadth of OLAP systems from diﬀerent categories of systems and execution models. We examine CPU cycles and memory bandwidth utilizations. The results show that, unlike traditional, commercial OLTP systems, traditional, commercial OLAP systems do not suﬀer from instruction cache...

work page

[14] [14]

Abadi, P

D. Abadi, P. Boncz, and S. Harizopoulos. The Design and Implementation of Modern Column-Oriented Database Systems. Now Publishers Inc., 2013

work page 2013

[15] [15]

Ailamaki, D

A. Ailamaki, D. J. DeWitt, M. D. Hill, and D. A. Wood. DBMSs on a Modern Processor: Where Does Time Go? VLDB, pages 266–277, 1999

work page 1999

[16] [16]

A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade. Performance Characterization of 12 UNDER SUBMISSION In-Memory Data Analytics on a Modern Cloud Server. BDCloud, pages 1–8, 2015

work page 2015

[17] [17]

A. J. Awan, M. Brorsson, V. Vlassov, and E. Ayguade. Micro-Architectural Characterization of Apache Spark on Batch and Stream Processing Workloads. BDCloud, pages 59–66, 2016

work page 2016

[18] [18]

Boncz, T

P. Boncz, T. Grust, M. van Keulen, S. Manegold, J. Rittinger, and J. Teubner. MonetDB/XQuery: A Fast XQuery Processor Powered by a Relational Engine. SIGMOD, pages 479–490, 2006

work page 2006

[19] [19]

Ferdman, A

M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Alisafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ailamaki, and B. Falsaﬁ. Clearing the Clouds: A Study of Emerging Scale-out Workloads on Modern Hardware. ASPLOS, pages 37–48, 2012

work page 2012

[20] [20]

Hardavellas, I

N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril, A. Ailamaki, and B. Falsaﬁ. Database Servers on Chip Multiprocessors: Limitations and Opportunities. CIDR, pages 79–87, 2007

work page 2007

[21] [21]

Idreos, F

S. Idreos, F. Groﬀen, N. Nes, S. Manegold, K. S. Mullender, and M. L. Kersten. MonetDB: Two Decades of Research in Column-oriented Database Architectures. IEEE Data Engineering Bulletin , 35(1):40–45, 2012

work page 2012

[22] [22]

Disclosure of Hardware Prefetcher Control on Some Intel Processors

Intel. Disclosure of Hardware Prefetcher Control on Some Intel Processors. https://software.intel.com/en-us/articles/disclosure- of-hw-prefetcher-control-on-some-intel-processors

work page

[23] [23]

Intel Memory Latency Checker

Intel. Intel Memory Latency Checker. https://software.intel.com/en-us/articles/intelr- memory-latency-checker

work page

[24] [24]

Understanding How General Exploration Works in Intel VTune Ampliﬁer, 2018

Intel. Understanding How General Exploration Works in Intel VTune Ampliﬁer, 2018. https://software.intel.com/en- us/articles/understanding-how-general-exploration- works-in-intel-vtune-ampliﬁer-xe

work page 2018

[25] [25]

Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, 2019

Intel. Intel(R) 64 and IA-32 Architectures Optimization Reference Manual, 2019

work page 2019

[26] [26]

Jonathan, U

C. Jonathan, U. F. Minhas, J. Hunter, J. Levandoski, and G. Nishanov. Exploiting Coroutines to Attack the ”Killer Nanoseconds”. Proc. VLDB Endow. , 11(11):1702–1714, July 2018

work page 2018

[27] [27]

Kanev, J

S. Kanev, J. P. Darago, K. Hazelwood, P. Ranganathan, T. Moseley, G. Wei, and D. Brooks. Proﬁling a warehouse-scale computer. ISCA, pages 158–169, 2015

work page 2015

[28] [28]

Karpathiotakis, I

M. Karpathiotakis, I. Alagiannis, and A. Ailamaki. Fast Queries over Heterogeneous Data Through Engine Customization. Proc. VLDB Endow. , 9(12):972–983, Aug. 2016

work page 2016

[29] [29]

Kemper and T

A. Kemper and T. Neumann. Hyper: A hybrid oltp olap main memory database system based on virtual memory snapshots. ICDE, pages 195–206, 2011

work page 2011

[30] [30]

Kersten, V

T. Kersten, V. Leis, A. Kemper, T. Neumann, A. Pavlo, and P. Boncz. Everything You Always Wanted to Know About Compiled and Vectorized Queries but Were Afraid to Ask. Proc. VLDB Endow. , 11(13):2209–2222, Sept. 2018

work page 2018

[31] [31]

Lahiri, S

T. Lahiri, S. Chavan, M. Colgan, D. Das, A. Ganesh, M. Gleeson, S. Hase, A. Holloway, J. Kamp, T. Lee, J. Loaiza, N. Macnaughton, V. Marwah, N. Mukherjee, A. Mullick, S. Muthulingam, V. Raja, M. Roth, E. Soylemez, and M. Zait. Oracle Database In-Memory: A Dual Format In-memory Database. ICDE, pages 1253–1258, 2015

work page 2015

[32] [32]

Larson, C

P.-A. Larson, C. Clinciu, E. N. Hanson, A. Oks, S. L. Price, S. Rangarajan, A. Surna, and Q. Zhou. SQL Server Column Store Indexes. SIGMOD, pages 1177–1184, 2011

work page 2011

[33] [33]

Manegold, P

S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing Main-Memory Join on Modern Hardware. IEEE Trans. Knowl. Data Eng. , 14(4):709–730, 2002

work page 2002

[34] [34]

Psaropoulos, T

G. Psaropoulos, T. Legler, N. May, and A. Ailamaki. Interleaving with Coroutines: A Practical Approach for Robust Index Joins. PVLDB, 11(2):230–242, 2017

work page 2017

[35] [35]

Psaropoulos, T

G. Psaropoulos, T. Legler, N. May, and A. Ailamaki. Interleaving with Coroutines: A Systematic and Practical Approach to Hide Memory Latency in Index Joins. The VLDB Journal , Dec 2018

work page 2018

[36] [36]

Psaropoulos, I

G. Psaropoulos, I. Oukid, T. Legler, N. May, and A. Ailamaki. Bridging the Latency Gap between NVM and DRAM for Latency-bound Operations. pages 13:1–13:8, 2019

work page 2019

[37] [37]

Raman, G

V. Raman, G. Attaluri, R. Barber, N. Chainani, D. Kalmuk, V. KulandaiSamy, J. Leenstra, S. Lightstone, S. Liu, G. M. Lohman, T. Malkemus, R. Mueller, I. Pandis, B. Schiefer, D. Sharpe, R. Sidle, A. Storm, and L. Zhang. DB2 with BLU Acceleration: So Much More Than Just a Column Store. Proc. VLDB Endow. , 6(11):1080–1091, Aug. 2013

work page 2013

[38] [38]

Sirin, P

U. Sirin, P. T¨ oz¨ un, D. Porobic, and A. Ailamaki. Micro-architectural Analysis of In-memory OLTP. SIGMOD, pages 387–402, 2016

work page 2016

[39] [39]

Sirin, A

U. Sirin, A. Yasin, and A. Ailamaki. A Methodology for OLTP Micro-architectural Analysis. Damon, pages 1:1–1:10, 2017

work page 2017

[40] [40]

Sompolski, M

J. Sompolski, M. Zukowski, and P. A. Boncz. Vectorization vs. Compilation in Query Execution. Damon, pages 33–40, 2011

work page 2011

[41] [41]

Sridharan and J

S. Sridharan and J. M. Patel. Proﬁling R on a Contemporary Processor. Proc. VLDB Endow. , 8(2):173–184, Oct. 2014

work page 2014

[42] [42]

T¨ oz¨ un, B

P. T¨ oz¨ un, B. Gold, and A. Ailamaki. OLTP in wonderland: Where do cache misses come from in major OLTP components? Damon, page 8, 2013

work page 2013

[43] [43]

T¨ oz¨ un, I

P. T¨ oz¨ un, I. Pandis, C. Kaynak, D. Jevdjic, and A. Ailamaki. From A to E: Analyzing TPC’s OLTP Benchmarks: The Obsolete, The Ubiquitous, The Unexplored. EDBT, pages 17–28, 2013

work page 2013

[44] [44]

Transcation Processing Performance Council

TPC. Transcation Processing Performance Council. http://www.tpc.org/

work page

[45] [45]

A. Yasin. A Top-Down Method for Performance Analysis and Counters Architecture. ISPASS, pages 35–44, 2014

work page 2014

[46] [46]

Yasin, Y

A. Yasin, Y. Ben-Asher, and A. Mendelson. Deep-dive Analysis of The Data Analytics Workload in CloudSuite. IISWC, pages 202–211, 2014. 13

work page 2014