pith. sign in

arxiv: 2406.06886 · v2 · submitted 2024-06-11 · 💻 cs.DB

Enabling Data Dependency-based Query Optimization

Pith reviewed 2026-05-24 00:24 UTC · model grok-4.3

classification 💻 cs.DB
keywords data dependenciesquery optimizationdatabase systemsTPC-DSJOB benchmarkprimary keysforeign keysperformance improvement
0
0 comments X

The pith

An automated system discovers and validates additional data dependencies to optimize queries without manual declarations or SQL rewrites.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that data dependencies beyond primary and foreign keys can be identified automatically and used to improve query performance in analytical databases. It first establishes the potential gains through experiments with rewritten SQL queries across multiple systems and benchmarks. It then describes a complete system that finds candidate dependencies, checks them efficiently, and feeds valid ones into the optimizer. If correct, this removes the need for experts to declare or maintain such dependencies while delivering speedups comparable to hand-tuned rewrites.

Core claim

The paper claims that an integrated system can recognize dependency candidates, validate them for optimization use, and apply them in query plans, matching the performance of dedicated SQL rewrites. Compared to PKs and FKs alone, it reports geometric mean speedups of 35% on TPC-DS and 29% on JOB, with some queries improving more than 90%, and discovery costs far below the gains from one workload run.

What carries the argument

The automated pipeline that recognizes dependency candidates, validates their applicability to queries, and integrates them into existing query optimizers without manual input.

If this is right

  • Queries achieve geometric mean speedups of 35% on TPC-DS and 29% on JOB over PK/FK-only optimization.
  • Individual query latencies can drop by more than 90% when valid dependencies are applied.
  • Dependency discovery overhead remains orders of magnitude smaller than the improvement from executing a workload once.
  • The gains appear across a range of analytical database systems when dependencies are used without SQL rewrites.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The low overhead suggests the approach remains practical even when queries run repeatedly on the same data.
  • Because no manual declaration is needed, the technique could extend to environments where schema changes frequently.
  • If validation scales with data size, similar automation might apply to larger analytical workloads beyond the tested benchmarks.

Load-bearing premise

Target datasets contain additional data dependencies that can be found and checked efficiently enough for the performance gains to outweigh the discovery cost.

What would settle it

Running the system on datasets that lack extra dependencies beyond PKs and FKs, or where validation time exceeds the latency savings on a workload, would show no net benefit.

Figures

Figures reproduced from arXiv: 2406.06886 by Daniel Lindner, Daniel Ritter, Felix Naumann.

Figure 2
Figure 2. Figure 2: Original query plan and versions successively rewritten using O-1, O-2, and O-3. Edges are annotated with the data [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Architectural overview of the automatic depen [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example of dependency propagation in the query [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Metadata-aware UCC validation using the on-the [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Latencies with and without dependency-based op [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Average candidate validation times for four bench [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
read the original abstract

Primary key (PK) and foreign key (FK) constraints are widely used for query optimization. Knowledge about additional data dependencies, such as order dependencies, enables further substantial performance improvements. However, such dependencies are not maintained by database systems or are even unknown to the user. Identifying and validating relevant dependencies automatically and efficiently remains an unsolved problem. This paper presents a system that (i) recognizes dependency candidates for optimization, (ii) efficiently validates their applicability, and (iii) optimizes query plans using valid dependencies. First, we demonstrate the performance impact of optimization techniques using data dependencies additional to PKs and FKs. Using rewritten SQL queries, we empirically show that data dependencies improve performance for a wide range of analytical database systems and benchmarks. Second, we present how to integrate data dependencies into a system to use them without (i) manual declaration and maintenance or (ii) SQL rewrites. Our integrated and fully automated system matches the performance of dedicated SQL rewrites: compared to using only PKs and FKs, queries improve with geometric mean speedups of 35 % for TPC-DS and 29 % for JOB. Individual query latencies drop by more than 90 %. The dependency discovery overhead is orders of magnitude lower than the latency improvement of a single workload execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a system for automatically recognizing, validating, and integrating data dependencies (beyond PK/FK constraints, including order dependencies) into query optimizers for analytical workloads. It first empirically demonstrates performance gains from such dependencies via hand-written SQL rewrites across database systems and benchmarks, then claims an integrated automated pipeline that matches those gains without manual declaration or rewrites, reporting geometric-mean speedups of 35% on TPC-DS and 29% on JOB (with some queries improving >90%) and discovery overhead orders of magnitude below query latency savings.

Significance. If the automated recognition+validation+integration pipeline is shown to surface the same dependencies and produce equivalent plan changes as the manual rewrites, the result would be significant: it would make dependency-based optimizations practical at scale without user intervention. The low-overhead claim and cross-system empirical gains (if reproducible) would strengthen the case for extending optimizers beyond PK/FK.

major comments (2)
  1. [Abstract] Abstract: the central claim that the 'integrated and fully automated system matches the performance of dedicated SQL rewrites' is load-bearing yet unsupported; no evidence is supplied that the discovery pipeline surfaces exactly the dependencies exploited by the rewrites or that the optimizer integration reproduces the same plan deltas.
  2. [Abstract] Abstract / experimental evaluation: the reported geometric-mean speedups (35% TPC-DS, 29% JOB) and individual >90% latency drops are presented without any description of experimental controls, statistical significance testing, workload selection criteria, or safeguards against post-hoc dependency selection, limiting verification of the performance claims.
minor comments (1)
  1. [Abstract] The abstract mentions 'order dependencies' as an example but does not enumerate the full set of dependency types handled by the system; a brief enumeration would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will revise the manuscript where the concerns identify opportunities for clarification or additional evidence.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the 'integrated and fully automated system matches the performance of dedicated SQL rewrites' is load-bearing yet unsupported; no evidence is supplied that the discovery pipeline surfaces exactly the dependencies exploited by the rewrites or that the optimizer integration reproduces the same plan deltas.

    Authors: The manuscript reports that the automated pipeline produces the same geometric-mean speedups as the hand-written rewrites (35% on TPC-DS, 29% on JOB). We agree, however, that an explicit side-by-side comparison of discovered dependencies and resulting plan deltas would make the equivalence claim more direct. We will add such a comparison (e.g., a table listing dependencies used in the manual rewrites versus those surfaced by the pipeline, together with optimizer plan differences) to the revised evaluation section. revision: yes

  2. Referee: [Abstract] Abstract / experimental evaluation: the reported geometric-mean speedups (35% TPC-DS, 29% JOB) and individual >90% latency drops are presented without any description of experimental controls, statistical significance testing, workload selection criteria, or safeguards against post-hoc dependency selection, limiting verification of the performance claims.

    Authors: The abstract is intentionally concise; the full experimental section describes the TPC-DS and JOB workloads, query selection, and the automated discovery/validation pipeline. We will nevertheless revise the abstract to include a short statement of the benchmarks used and a pointer to the detailed methodology. We will also add any missing statistical significance results and an explicit description of how the candidate-generation step avoids post-hoc selection (candidates are enumerated from schema and data statistics independently of the query workload). revision: partial

Circularity Check

0 steps flagged

No circularity: empirical system benchmarks with direct measurements

full rationale

The paper describes a practical system for auto-discovering, validating, and integrating data dependencies into query optimizers, evaluated via direct runtime benchmarks on TPC-DS and JOB workloads. Speedups (geometric means 35% and 29%) and latency reductions are reported as measured outcomes from the implemented pipeline, not as quantities derived from equations, fitted parameters, or self-referential definitions. No load-bearing derivations, uniqueness theorems, or ansatzes appear; the central claims rest on experimental comparison to PK/FK baselines and hand-written rewrites rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard database assumptions that query optimizers can exploit additional dependencies when present and that such dependencies occur in real analytical workloads; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Database query optimizers can exploit data dependencies beyond primary and foreign keys when they are known and valid.
    Invoked in the opening sentences of the abstract as the basis for performance improvements.

pith-pipeline@v0.9.0 · 5749 in / 1208 out tokens · 24592 ms · 2026-05-24T00:24:09.308494+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages

  1. [1]

    Abadi, Samuel Madden, and Nabil Hachem

    Daniel J. Abadi, Samuel Madden, and Nabil Hachem. 2008. Column-stores vs. row-stores: how different are they really?. In Proceedings of the International Conference on Management of Data (SIGMOD) . 967–980

  2. [2]

    Ziawasch Abedjan, Lukasz Golab, and Felix Naumann. 2015. Profiling relational data: a survey. The VLDB Journal 24, 4 (2015), 557–581

  3. [3]

    Lee, Andrew Witkowski, Dinesh Das, Hong Su, Mohamed Zaït, and Thierry Cruanes

    Rafi Ahmed, Allison W. Lee, Andrew Witkowski, Dinesh Das, Hong Su, Mohamed Zaït, and Thierry Cruanes. 2006. Cost-Based Query Transformation in Oracle. In Proceedings of the International Conference on Very Large Databases (VLDB) . 1026–1036

  4. [4]

    RJ Atwal, Peter Boncz, Ryan Boyd, Antony Courtney, Till Döhmen, Florian Ger- linghoff, Jeff Huang, Joseph Hwang, Raphael Hyde, Elena Felder, Jacob Lacouture, Yves LeMaout, Boaz Leskes, Yao Liu, Alex Monahan, Dan Perkins, Tino Tereshko, Jordan Tigani, Nick Ursa, Stephanie Wang, and Yannick Welsch. 2024. Mother- Duck: DuckDB in the cloud and in the client. ...

  5. [5]

    Maximilian Bandle, Jana Giceva, and Thomas Neumann. 2021. To Partition, or Not to Partition, That is the Join Question in a Real System. In Proceedings of the International Conference on Management of Data (SIGMOD) . 168–180

  6. [6]

    Yuanzhe Bei, Thao Pham, Akshay Aggarwal, Nga Tran, Jaimin Dave, Chuck Bear, and Michael Leuchtenburg. 2019. Vertica Flattened Tables and Live Aggregate Projections: A Column-based Alternative to Materialized Views for Analytics. In Proceedings of the International Conference on Big Data (BigData) . 1749–1758

  7. [7]

    Siegfried Bell. 1997. Dependency Mining in Relational Databases. In Proceedings of the International Joint Conference on Qualitative and Quantitative Practical Reasoning (ECSQARU-FAPR). 16–29

  8. [8]

    Siegfried Bell and Peter Brockhausen. 1995. Discovery of Data Dependencies in Relational Databases. Technical Report. University Dortmund. 6 pages

  9. [9]

    Srikanth Bellamkonda, Rafi Ahmed, Andrew Witkowski, Angela Amor, Mohamed Zaït, and Chun Chieh Lin. 2009. Enhanced Subquery Optimizations in Oracle. Proceedings of the VLDB Endowment (PVLDB) 2, 2 (2009), 1366–1377

  10. [10]

    Carsten Binnig, Stefan Hildenbrand, and Franz Färber. 2009. Dictionary-based order-preserving string compression for main memory column stores. InProceed- ings of the International Conference on Management of Data (SIGMOD) . 283–296

  11. [11]

    Johann Birnick, Thomas Bläsius, Tobias Friedrich, Felix Naumann, Thorsten Papenbrock, and Martin Schirneck. 2020. Hitting Set Enumeration with Partial Information for Unique Column Combination Discovery. Proceedings of the VLDB Endowment (PVLDB) 13, 11 (2020), 2270–2283

  12. [12]

    Boncz, Thomas Neumann, and Orri Erling

    Peter A. Boncz, Thomas Neumann, and Orri Erling. 2013. TPC-H Analyzed: Hid- den Messages and Lessons Learned from an Influential Benchmark. InProceedings of the TPC Technology Conference (TPCTC) . 61–76

  13. [13]

    Casanova, Luiz Tucherman, and Antonio L

    Marco A. Casanova, Luiz Tucherman, and Antonio L. Furtado. 1988. Enforcing Inclusion Dependencies and Referencial Integrity. In VLDB. 38–49

  14. [14]

    Edgar F. Codd. 1971. Further Normalization of the Data Base Relational Model . Research Report RJ909. IBM. 33 pages

  15. [15]

    C. J. Date and Hugh Darwen. 1992. Relational Database Writings 1989-1991 . Addison-Wesley, Chapter The Role of functional Dependence in Query Decom- position, 133–150

  16. [16]

    Markus Dreseler, Martin Boissier, Tilmann Rabl, and Matthias Uflacker. 2020. Quantifying TPC-H Choke Points and Their Optimizations. Proceedings of the VLDB Endowment (PVLDB) 13, 8 (2020), 1206–1220

  17. [17]

    Markus Dreseler, Jan Kossmann, Martin Boissier, Stefan Klauck, Matthias Uflacker, and Hasso Plattner. 2019. Hyrise Re-engineered: An Extensible Database System for Research in Relational In-Memory Data Management. In Proceed- ings of the International Conference on Extending Database Technology (EDBT) . 313–324

  18. [18]

    Falco Dürsch, Axel Stebner, Fabian Windheuser, Maxi Fischer, Tim Friedrich, Nils Strelow, Tobias Bleifuß, Hazar Harmouch, Lan Jiang, Thorsten Papenbrock, and Felix Naumann. 2019. Inclusion Dependency Discovery: An Experimental Evaluation of Thirteen Algorithms. In Proceedings of the International Conference on Information and Knowledge Management (CIKM) . 219–228

  19. [19]

    Ronald Fagin and Moshe Y. Vardi. 1984. The Theory of Data Dependencies - An Overview. In Proceedings of the International Colloquium on Automata, Languages and Programming (ICALP). 1–22

  20. [20]

    Wenfei Fan, Floris Geerts, and Xibei Jia. 2008. Semandaq: a data quality sys- tem based on conditional functional dependencies. Proceedings of the VLDB Endowment (PVLDB) 1, 2 (2008), 1460–1463

  21. [21]

    Franz Färber, Sang Kyun Cha, Jürgen Primsch, Christof Bornhövd, Stefan Sigg, and Wolfgang Lehner. 2011. SAP HANA database: data management for modern business applications. SIGMOD Record 40, 4 (2011), 45–51

  22. [22]

    Franz Färber, Norman May, Wolfgang Lehner, Philipp Große, Ingo Müller, Hannes Rauhe, and Jonathan Dees. 2012. The SAP HANA Database – An Architecture Overview. IEEE Data Engineering Bulletin 35, 1 (2012), 28–33

  23. [23]

    Ganski and Harry K

    Richard A. Ganski and Harry K. T. Wong. 1987. Optimization of Nested SQL Queries Revisited. In Proceedings of the International Conference on Management of Data (SIGMOD). 23–33

  24. [24]

    Goetz Graefe, Ross Bunker, and Shaun Cooper. 1998. Hash Joins and Hash Teams in Microsoft SQL Server. In Proceedings of the International Conference on Very Large Databases (VLDB). 86–97

  25. [25]

    Haas, Johann Christoph Freytag, Guy M

    Laura M. Haas, Johann Christoph Freytag, Guy M. Lohman, and Hamid Pirahesh

  26. [26]

    InProceedings of the International Conference on Management of Data (SIGMOD)

    Extensible Query Processing in Starburst. InProceedings of the International Conference on Management of Data (SIGMOD) . 377–388

  27. [27]

    Knoblock

    Chun-Nan Hsu and Craig A. Knoblock. 1996. Using Inductive Learning To Generate Rules for Semantic Query Optimization. In Advances in Knowledge Discovery and Data Mining . AAAI/MIT Press, 425–445

  28. [28]

    Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An Efficient Algorithm for Discovering Functional and Approximate Dependen- cies. Comput. J. 42, 2 (1999), 100–111

  29. [29]

    Sjoerd Mullender, and Martin L

    Stratos Idreos, Fabian Groffen, Niels Nes, Stefan Manegold, K. Sjoerd Mullender, and Martin L. Kersten. 2012. MonetDB: Two Decades of Research in Column- oriented Database Architectures. IEEE Data Engineering Bulletin 35, 1 (2012), 40–45

  30. [30]

    International Organization for Standardization. 2023. Information technology – Database languages SQL – Part 2: Foundation (SQL/Foundation) . Standard Specification ISO/IEC 9075-2:2023(E)

  31. [31]

    Ioannidis

    Yannis E. Ioannidis. 1996. Query Optimization. Comput. Surveys 28, 1 (1996), 121–123

  32. [32]

    Won Kim. 1982. On Optimizing an SQL-like Nested Query. ACM Transactions on Database Systems (TODS) 7, 3 (1982), 443–469

  33. [33]

    Jonathan J. King. 1980. Modelling Concepts for Reasoning About Access to Knowledge. In Proceedings of the Workshop on Data Abstraction, Databases and Conceptual Modelling. 138–140

  34. [34]

    Jan Kossmann, Daniel Lindner, Felix Naumann, and Thorsten Papenbrock. 2022. Workload-driven, Lazy Discovery of Data Dependencies for Query Optimization. In Proceedings of the Conference on Innovative Data Systems Research (CIDR) . 7 pages

  35. [35]

    Jan Kossmann, Thorsten Papenbrock, and Felix Naumann. 2022. Data dependen- cies for query optimization: a survey. The VLDB Journal 31, 1 (2022), 1–22

  36. [36]

    Hanson, Weiyun Huang, Michal Nowakiewicz, and Vassilis Papadimos

    Per-Åke Larson, Adrian Birka, Eric N. Hanson, Weiyun Huang, Michal Nowakiewicz, and Vassilis Papadimos. 2015. Real-Time Analytical Process- ing with SQL Server. Proceedings of the VLDB Endowment (PVLDB) 8, 12 (2015), 1740–1751

  37. [37]

    Patel, and Mike Zwilling

    Per-Åke Larson, Spyros Blanas, Cristian Diaconu, Craig Freedman, Jignesh M. Patel, and Mike Zwilling. 2011. High-Performance Concurrency Control Mecha- nisms for Main-Memory Databases. Proceedings of the VLDB Endowment (PVLDB) 5, 4 (2011), 298–309

  38. [38]

    Boncz, Alfons Kem- per, and Thomas Neumann

    Viktor Leis, Andrey Gubichev, Atanas Mirchev, Peter A. Boncz, Alfons Kem- per, and Thomas Neumann. 2015. How Good Are Query Optimizers, Really? Proceedings of the VLDB Endowment (PVLDB) 9, 3 (2015), 204–215

  39. [39]

    Maurizio Lenzerini. 2002. Data Integration: A Theoretical Perspective. In Pro- ceedings of the Symposium on Principles of Database Systems (PODS) . 233–246

  40. [40]

    Mark Levene and George Loizou. 2003. Why is the snowflake schema a good data warehouse design? Information Systems (IS) 28, 3 (2003), 225–240

  41. [41]

    Xiaoxuan Liu, Shuxian Wang, Mengzhu Sun, Sicheng Pan, Ge Li, Siddharth Jha, Cong Yan, Junwen Yang, Shan Lu, and Alvin Cheung. 2023. Leveraging Application Data Constraints to Optimize Database-Backed Web Applications. Proceedings of the VLDB Endowment (PVLDB) 16, 6 (2023), 1208–1221

  42. [42]

    Lucchesi and Sylvia L

    Claudio L. Lucchesi and Sylvia L. Osborn. 1978. Candidate Keys for Relations. J. Comput. System Sci. 17, 2 (1978), 270–279

  43. [43]

    Zhenghua Lyu, Huan Hubert Zhang, Gang Xiong, Gang Guo, Haozhou Wang, Jinbao Chen, Asim Praveen, Yu Yang, Xiaoming Gao, Alexandra Wang, Wen Lin, Ashwin Agrawal, Junfeng Yang, Hao Wu, Xiaoliang Li, Feng Guo, Jiang Wu, Jesse Zhang, and Venkatesh Raghavan. 2021. Greenplum: A Hybrid Database for Transactional and Analytical Workloads. In Proceedings of the Int...

  44. [44]

    Bernstein, and Erhard Rahm

    Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm. 2001. Generic Schema Matching with Cupid. InProceedings of the International Conference on Very Large Databases (VLDB). 49–58

  45. [45]

    Norman May, Alexander Böhm, and Wolfgang Lehner. 2017. SAP HANA - The Evolution of an In-Memory DBMS from Pure OLAP Processing Towards Mixed Workloads. In Proceedings of the Conference Datenbanksysteme in Business, Technologie und Web Technik (BTW). 545–563

  46. [46]

    Niloy Mukherjee, Shasank Chavan, Maria Colgan, Dinesh Das, Mike Gleeson, Sanket Hase, Allison Holloway, Hui Jin, Jesse Kamp, Kartik Kulkarni, Tirthankar Lahiri, Juan Loaiza, Neil MacNaughton, Vineet Marwah, Atrayee Mullick, Andy Witkowski, Jiaqi Yan, and Mohamed Zaït. 2015. Distributed Architecture of Oracle Database In-memory. Proceedings of the VLDB End...

  47. [47]

    Thomas Neumann. 2014. Engineering High-Performance Database Engines. Proceedings of the VLDB Endowment (PVLDB) 7, 13 (2014), 1734–1741

  48. [48]

    Thomas Neumann and Michael J. Freitag. 2020. Umbra: A Disk-Based System with In-Memory Performance. In Proceedings of the Conference on Innovative Data Systems Research (CIDR) . 7 pages

  49. [49]

    Anisoara Nica, Reza Sherkat, Mihnea Andrei, Xun Chen, Martin Heidel, Christian Bensberg, and Heiko Gerwens. 2017. Statisticum: Data Statistics Management in SAP HANA. Proceedings of the VLDB Endowment (PVLDB) 10, 12 (2017), Daniel Lindner, Daniel Ritter, and Felix Naumann 1658–1669

  50. [50]

    O’Neil, Elizabeth J

    Patrick E. O’Neil, Elizabeth J. O’Neil, and Xuedong Chen. 2009. Star Schema Benchmark. Standard Specification Revision 3. https://www.cs.umb.edu/~poneil/ StarSchemaB.PDF (accessed April 9, 2024)

  51. [51]

    O’Neil, Elizabeth J

    Patrick E. O’Neil, Elizabeth J. O’Neil, Xuedong Chen, and Stephen Revilak. 2009. The Star Schema Benchmark and Augmented Fact Table Indexing. InProceedings of the TPC Technology Conference (TPCTC) . 237–252

  52. [52]

    Oracle. [n. d.]. MySQL 8.0 Reference Manual – Optimizing IN and EXISTS Subquery Predicates with Semijoin Transformations . https://dev.mysql.com/doc/refman/8. 0/en/semijoins.html (accessed April 9, 2024)

  53. [53]

    Orr, Srikanth Kandula, and Surajit Chaudhuri

    Laurel J. Orr, Srikanth Kandula, and Surajit Chaudhuri. 2019. Pushing Data- Induced Predicates Through Joins in Big-Data Clusters. Proceedings of the VLDB Endowment (PVLDB) 13, 3 (2019), 252–265

  54. [54]

    Thorsten Papenbrock, Jens Ehrlich, Jannik Marten, Tommy Neubert, Jan-Peer Rudolph, Martin Schönberg, Jakob Zwiener, and Felix Naumann. 2015. Func- tional Dependency Discovery: An Experimental Evaluation of Seven Algorithms. Proceedings of the VLDB Endowment (PVLDB) 8, 10 (2015), 1082–1093

  55. [55]

    Thorsten Papenbrock and Felix Naumann. 2017. A Hybrid Approach for Effi- cient Unique Column Combination Discovery. In Proceedings of the Conference Datenbanksysteme in Business, Technologie und Web Technik (BTW) . 195–204

  56. [56]

    Mowry, Matthew Perron, Ian Quah, Siddharth San- turkar, Anthony Tomasic, Skye Toor, Dana Van Aken, Ziqi Wang, Yingjun Wu, Ran Xian, and Tieying Zhang

    Andrew Pavlo, Gustavo Angulo, Joy Arulraj, Haibin Lin, Jiexi Lin, Lin Ma, Prashanth Menon, Todd C. Mowry, Matthew Perron, Ian Quah, Siddharth San- turkar, Anthony Tomasic, Skye Toor, Dana Van Aken, Ziqi Wang, Yingjun Wu, Ran Xian, and Tieying Zhang. 2017. Self-Driving Database Management Sys- tems. In Proceedings of the Conference on Innovative Data Syste...

  57. [57]

    Eduardo H. M. Pena, Erik Falk, Jorge Augusto Meira, and Eduardo Cunha de Almeida. 2018. Mind Your Dependencies for Semantic Query Optimization. J. Inf. Data Manag. 9, 1 (2018), 3–19

  58. [58]

    Mark Raasveldt and Hannes Mühleisen. 2019. DuckDB: an Embeddable Analytical Database. In Proceedings of the International Conference on Management of Data (SIGMOD). 1981–1984

  59. [59]

    Attaluri, Ronald Barber, Naresh Chainani, David Kalmuk, Vincent KulandaiSamy, Jens Leenstra, Sam Lightstone, Shaorong Liu, Guy M

    Vijayshankar Raman, Gopi K. Attaluri, Ronald Barber, Naresh Chainani, David Kalmuk, Vincent KulandaiSamy, Jens Leenstra, Sam Lightstone, Shaorong Liu, Guy M. Lohman, Tim Malkemus, René Müller, Ippokratis Pandis, Berni Schiefer, David Sharpe, Richard Sidle, Adam J. Storm, and Liping Zhang. 2013. DB2 with BLU Acceleration: So Much More than Just a Column St...

  60. [60]

    Aref, Ahmed K

    El Kindi Rezig, Mourad Ouzzani, Walid G. Aref, Ahmed K. Elmagarmid, Ahmed R. Mahmood, and Michael Stonebraker. 2021. Horizon: Scalable Dependency-driven Data Cleaning. Proceedings of the VLDB Endowment (PVLDB) 14, 11 (2021), 2546– 2554

  61. [61]

    Philipp Schirmer, Thorsten Papenbrock, Sebastian Kruse, Felix Naumann, Dennis Hempfing, Torben Mayer, and Daniel Neuschäfer-Rube. 2019. DynFD: Functional Dependency Discovery in Dynamic Datasets. In Proceedings of the International Conference on Extending Database Technology (EDBT) . 253–264

  62. [62]

    Shashi Shekhar, Babak Hamidzadeh, Ashim Kohli, and Mark Coyle. 1993. Learn- ing Transformation Rules for Semantic Query Optimization: A Data-Driven Approach. IEEE Transactions on Knowledge and Data Engineering (TKDE) 5, 6 (1993), 950–964

  63. [63]

    Siegel, Edward Sciore, and Sharon C

    Michael D. Siegel, Edward Sciore, and Sharon C. Salveter. 1992. A Method for Automatic Rule Derivation to Support Semantic Query Optimization. ACM Transactions on Database Systems (TODS) 17, 4 (1992), 563–600

  64. [64]

    Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C

    Mohamed A. Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C. Caragea, Carlos Garcia-Alvarado, Foyzur Rahman, Michalis Petropoulos, Florian Waas, Sivaramakrishnan Narayanan, Konstantinos Krikellas, and Rhonda Baldwin. 2014. Orca: a modular query optimizer architecture for big data. In Proceedings of the Interna...

  65. [65]

    Jaroslaw Szlichta, Parke Godfrey, and Jarek Gryz. 2012. Fundamentals of Order Dependencies. Proceedings of the VLDB Endowment (PVLDB) 5, 11 (2012), 1220– 1231

  66. [66]

    Jaroslaw Szlichta, Parke Godfrey, Jarek Gryz, Wenbin Ma, Przemyslaw Pawluk, and Calisto Zuzarte. 2011. Queries on dates: fast yet not blind. In Proceedings of the International Conference on Extending Database Technology (EDBT) . 497–502

  67. [67]

    2021.TPC Benchmark DS

    Transaction Processing Performance Council. 2021.TPC Benchmark DS. Standard Specification Version 3.2.0. http://tpc.org/tpc_documents_current_versions/pdf/ tpc-ds_v3.2.0.pdf (accessed April 9, 2024)

  68. [68]

    Transaction Processing Performance Council. 2022. TPC Benchmark H. Standard Specification Revision 3.0.1. http://tpc.org/tpc_documents_current_versions/ pdf/tpc-h_v3.0.1.pdf (accessed April 9, 2024)

  69. [69]

    Jeffrey D. Ullman. 1988. Principles of Database and Knowledge-Base Systems, Volume I. Principles of computer science series, Vol. 14. Computer Science Press

  70. [70]

    J. Beau W. Webber. 2013. A bi-symmetric log transformation for wide-range data. Measurement Science and Technology 24, 2 (2013), 3 pages

  71. [71]

    Yu and Wei Sun

    Clement T. Yu and Wei Sun. 1989. Automatic Knowledge Acquisition and Main- tenance for Semantic Query Optimization. IEEE Transactions on Knowledge and Data Engineering (TKDE) 1, 3 (1989), 362–375

  72. [72]

    Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. 2023. An Empirical Evaluation of Columnar Storage Formats. Proceedings of the VLDB Endowment (PVLDB) 17, 2 (2023), 148–161

  73. [73]

    Mohamed Ziauddin, Andrew Witkowski, You Jung Kim, Janaki Lahorani, Dmitry Potapov, and Murali Krishna. 2017. Dimensions Based Data Clustering and Zone Maps. Proceedings of the VLDB Endowment (PVLDB) 10, 12 (2017), 1622–1633