Architectural Evolution and Selection Framework for Database Systems in AI-Ready Data Platforms

Mohit Srivastava

arxiv: 2606.08317 · v1 · pith:NPHHHIJMnew · submitted 2026-06-06 · 💻 cs.DB

Architectural Evolution and Selection Framework for Database Systems in AI-Ready Data Platforms

Mohit Srivastava This is my paper

Pith reviewed 2026-06-27 18:39 UTC · model grok-4.3

classification 💻 cs.DB

keywords database architecture selectionpolyglot persistenceAI-ready platformshybrid architecturesworkload characterizationcross-paradigm analysislakehouse storagefinancial fraud detection

0 comments

The pith

A framework of nine dimensions and multi-stage selection identifies hybrid polyglot architectures as optimal for AI-ready data platforms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a unified framework to evaluate and select database architectures for AI-ready platforms that must handle varied workloads. It rests on nine architectural dimensions and a process that starts with workload characterization, applies constraint filtering, and ends with compatibility scoring to compare options systematically. Analysis of thirteen database paradigms uncovers three recurring evolutionary patterns: decoupling of storage and compute, workload-driven specialization, and convergence toward integrated AI platforms. A case study on financial fraud detection applies the framework to show that hybrid polyglot architectures perform best when requirements span transactional, analytical, and AI-oriented needs. The work concludes with an AI-ready reference architecture that layers lakehouse storage, feature processing, and semantic retrieval as the base for analytics, machine learning, and retrieval-augmented generation.

Core claim

The paper claims that its cross-paradigm framework, grounded in nine dimensions and a multi-stage process of workload characterization, constraint filtering, and compatibility scoring, enables systematic evaluation of database architectures. When used to compare thirteen paradigms, it uncovers patterns of decoupling storage and compute, workload-driven specialization, and convergence on AI-ready platforms. The framework's application to an enterprise fraud detection scenario shows hybrid polyglot architectures as the preferred solution for multidimensional workloads, leading to a proposed reference architecture integrating lakehouse, feature, and semantic layers.

What carries the argument

Nine architectural dimensions combined with the multi-stage selection process of workload characterization, constraint filtering, and compatibility scoring.

If this is right

Database selection shifts from ad-hoc intuition to a repeatable, cross-paradigm comparison.
Hybrid polyglot architectures emerge as the solution for workloads with mixed transactional, analytical, and AI requirements.
Three evolutionary patterns guide future platform design: storage-compute decoupling, specialization, and integration of AI capabilities.
A reference architecture using lakehouse storage plus feature and semantic layers supports modern analytics and generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same nine dimensions could be tested on workloads outside finance, such as healthcare or e-commerce, to check broader applicability.
If new database paradigms appear that resist classification along the existing dimensions, the framework would require explicit extension.
Organizations could embed the multi-stage process into procurement tools to quantify trade-offs before migration decisions.
The observed convergence toward AI-ready platforms implies that future single-paradigm systems will need native support for feature stores and vector search to remain competitive.

Load-bearing premise

The nine architectural dimensions and the multi-stage selection process are both comprehensive enough and sufficient to capture all decision-relevant factors across transactional, analytical, and AI-oriented database systems.

What would settle it

A deployment of the framework on a multidimensional workload in which the recommended hybrid polyglot architecture underperforms a single-paradigm system on cost, latency, or accuracy metrics.

Figures

Figures reproduced from arXiv: 2606.08317 by Mohit Srivastava.

**Figure 5.** Figure 5: Reference RAG architecture separating offline indexing (chunking, embedding, vector index build) from online inferenc [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Conceptual framework for database architecture selection based on workload profiling, architectural constraint filter [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

read the original abstract

The rise of polyglot data management and AI-ready database architectures has created a complex design space across diverse database paradigms. However, architecture selection in modern enterprise environments continues to rely heavily on ad-hoc engineering intuition, with limited systematic frameworks to guide decision-making across heterogeneous database systems. This paper introduces a unified cross-paradigm evaluation and selection framework for database architecture design in AI-ready data platforms. The framework is based on nine architectural dimensions and incorporates a structured multi-stage selection process involving workload characterization, constraint filtering, and compatibility scoring to enable systematic comparison and decision-making. To ground the framework, we conduct a structured comparative analysis across thirteen major database paradigms spanning transactional, analytical, and AI-oriented systems. This analysis reveals three recurring patterns in database evolution: decoupling of storage and compute, workload-driven specialization, and convergence toward integrated AI-ready platforms. The proposed framework is demonstrated through a representative enterprise case study in financial fraud detection, illustrating how hybrid, polyglot architectures emerge as optimal solutions for multidimensional workload requirements. The cross-paradigm analysis culminates in an AI-ready reference architecture that integrates lakehouse storage, feature processing, and semantic retrieval layers as the unified substrate for modern analytics, machine learning, and Retrieval-Augmented Generation applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a nine-dimension checklist for database choices in AI platforms and flags hybrid setups as optimal, but the supporting analysis is descriptive with no visible validation or benchmarks.

read the letter

The core contribution is a framework that breaks database architecture decisions into nine dimensions and runs them through workload characterization, constraint filtering, and compatibility scoring. It then maps this across thirteen paradigms and uses a financial fraud case to conclude that polyglot hybrids win for mixed transactional, analytical, and AI workloads.

What works is the organization. Pulling together decoupling of storage and compute, workload specialization, and convergence toward lakehouse-plus-AI layers into one reference architecture gives practitioners a single place to line up options. The patterns it names match what many teams already discuss, so the value is mainly in making those patterns explicit rather than in discovering them.

The soft spots sit in the evidence. The abstract and description supply no derivation for how the nine dimensions were selected, no ablation showing what happens if one is removed, no inter-rater scores on the compatibility step, and no head-to-head comparison against existing selection methods or expert intuition. A single case study without quantitative outcomes or sensitivity checks leaves the optimality claim resting on the untested assumption that the chosen dimensions cover every relevant factor.

This is aimed at enterprise architects and data-platform engineers who need a structured way to compare options rather than at researchers seeking new theorems or reproducible experiments. A reader already deep in the space will treat it as a useful checklist; someone expecting formal grounding or external benchmarks will see the gap.

The work shows clear thinking in how it assembles the pieces, but the central claims about systematic superiority stay unsupported by the visible data. I would not send it for peer review in its current form unless the full manuscript adds quantitative validation or reproducible scoring artifacts.

Referee Report

3 major / 1 minor

Summary. The paper introduces a unified cross-paradigm evaluation and selection framework for database architecture design in AI-ready data platforms. The framework is based on nine architectural dimensions and a multi-stage selection process (workload characterization, constraint filtering, compatibility scoring). It performs a comparative analysis across thirteen database paradigms spanning transactional, analytical, and AI-oriented systems, identifies three recurring patterns in database evolution (decoupling of storage and compute, workload-driven specialization, and convergence toward integrated AI-ready platforms), and demonstrates the framework through a financial fraud detection case study, concluding that hybrid polyglot architectures emerge as optimal and proposing an AI-ready reference architecture integrating lakehouse storage, feature processing, and semantic retrieval layers.

Significance. If the nine dimensions and multi-stage process can be shown to be comprehensive and reproducible, the work would supply a valuable systematic alternative to ad-hoc intuition for selecting database architectures in polyglot, AI-integrated environments. The cross-paradigm analysis and reference architecture could usefully inform design decisions for multidimensional workloads. The absence of quantitative validation, benchmarking, or sensitivity analysis, however, limits the strength of these contributions.

major comments (3)

[Abstract] Abstract: the central claim that the framework enables systematic comparison and reveals hybrid polyglot architectures as optimal rests on application to thirteen paradigms and one case study, yet the abstract (and by extension the manuscript) supplies no quantitative scores, exclusion criteria, error analysis, or comparison against existing selection methods, leaving the optimality conclusion unsupported by visible evidence.
[Framework description] Framework section (likely §3): the nine architectural dimensions are asserted to be both necessary and sufficient for cross-paradigm comparison, but the manuscript provides no derivation of how the dimensions were selected, no ablation (e.g., effect of removing one dimension on the ranking), and no inter-expert reproducibility test of the compatibility scoring step; this directly undermines the reliability of the multi-stage process.
[Case study] Case study section: the fraud-detection example concludes that hybrid architectures are optimal, but without sensitivity analysis on the nine dimensions or external benchmarking, the result remains dependent on the untested premise that the chosen dimensions capture every decision-relevant factor.

minor comments (1)

[Abstract] Abstract: the three reported patterns are presented as independently derived, but it is unclear whether they are restatements of the input dimensions and case-study outcomes rather than externally validated observations.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the insightful comments, which help clarify the scope and limitations of our proposed framework. We address each major comment below.

read point-by-point responses

Referee: [Abstract] the central claim that the framework enables systematic comparison and reveals hybrid polyglot architectures as optimal rests on application to thirteen paradigms and one case study, yet the abstract supplies no quantitative scores, exclusion criteria, error analysis, or comparison against existing selection methods, leaving the optimality conclusion unsupported by visible evidence.

Authors: We agree that the abstract could be improved to avoid overstating the evidence. The manuscript's contribution is the framework itself and its application, which is qualitative. We will revise the abstract to state that the framework supports systematic comparison and that the case study shows hybrid architectures emerging as suitable, rather than claiming they are optimal in a general sense. revision: yes
Referee: [Framework description] the nine architectural dimensions are asserted to be both necessary and sufficient for cross-paradigm comparison, but the manuscript provides no derivation of how the dimensions were selected, no ablation (e.g., effect of removing one dimension on the ranking), and no inter-expert reproducibility test of the compatibility scoring step; this directly undermines the reliability of the multi-stage process.

Authors: The nine dimensions were identified based on a synthesis of database system literature covering transactional, analytical, and AI-oriented paradigms. We will expand the framework section to provide the derivation and sources for the dimensions. We did not include ablation or reproducibility tests, as the work is a proposal for a selection framework rather than an empirical validation study. A new limitations subsection will be added to discuss this. revision: partial
Referee: [Case study] the fraud-detection example concludes that hybrid architectures are optimal, but without sensitivity analysis on the nine dimensions or external benchmarking, the result remains dependent on the untested premise that the chosen dimensions capture every decision-relevant factor.

Authors: We acknowledge this limitation in the case study. We will revise the case study to frame the conclusion as an example of how the framework can be applied to identify hybrid solutions for the given workload, rather than a general optimality claim. We will also add a note on the dependence on the chosen dimensions and suggest sensitivity analysis as future work. revision: partial

standing simulated objections not resolved

The absence of quantitative validation, benchmarking, sensitivity analysis, ablation studies, and inter-expert reproducibility tests, as these would require additional empirical research beyond the current manuscript.

Circularity Check

0 steps flagged

No significant circularity; framework presented as proposal without self-referential reduction

full rationale

The paper introduces nine architectural dimensions and a multi-stage selection process as the basis for a new framework, then applies them to a comparative analysis of thirteen paradigms and one case study to surface three patterns and an optimal hybrid architecture. No equations, fitted parameters, or self-citations are shown reducing the claimed patterns or optimality conclusion back to the input dimensions by construction. The dimensions are posited rather than derived from the outcomes they evaluate, and the analysis is presented as an application rather than a closed loop. This is a standard proposal of a decision framework with no load-bearing self-definition or renaming of known results as new derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the untested premise that exactly nine dimensions suffice and that the three-stage process is optimal; these choices appear introduced for the paper rather than derived from external benchmarks.

axioms (2)

ad hoc to paper Nine architectural dimensions are both necessary and sufficient for cross-paradigm comparison.
Stated in the abstract as the basis of the framework; no justification or prior reference supplied.
ad hoc to paper The multi-stage process (characterization, filtering, scoring) produces reliable optimality rankings.
Presented as the core method without external validation or formal proof.

pith-pipeline@v0.9.1-grok · 5738 in / 1398 out tokens · 19735 ms · 2026-06-27T18:39:41.773908+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 32 canonical work pages

[1]

One size fits all: An idea whose time has come and gone,

M. Stonebraker and U. Çetintemel, “One size fits all: An idea whose time has come and gone,” in Proc. IEEE 21st Int. Conf. Data Engineering (ICDE), Tokyo, Japan, 2005, pp. 2–11, doi: 10.1109/ICDE.2005.1

work page doi:10.1109/icde.2005.1 2005
[2]

Architecture of a Database System,

J. M. Hellerstein, M. Stonebraker, and J. Hamilton, “Architecture of a Database System,” Foundations and Trends in Databases, vol. 1, no. 2, pp. 141–259, 2007, doi: 10.1561/1900000002

work page doi:10.1561/1900000002 2007
[3]

Survey of Large-Scale Data Management Systems for Big Data Applications,

L. Wu, L. Yuan, and J. You, “Survey of Large-Scale Data Management Systems for Big Data Applications,” Journal of Computer Science and Technology, vol. 30, no. 1, pp. 163–183, 2015, doi: 10.1007/s11390-015- 1511-8

work page doi:10.1007/s11390-015- 2015
[4]

A Survey on NoSQL Databases,

S. Gajendran, “A Survey on NoSQL Databases,” International Journal of Advanced Research in Computer Science and Software Engineering, vol. 3, no. 2, pp. 192–195, 2013

2013
[5]

Main-Memory Database Systems: An Overview,

F. Faerber, A. Kemper, P.-Å. Larson, J. Levandoski, T. Neumann, A. Pavlo, “Main-Memory Database Systems: An Overview,” Foundations and Trends in Databases, vol. 8, no. 1–2, pp. 1–130, 2017, doi: 10.1561/1900000058

work page doi:10.1561/1900000058 2017
[6]

Time Series Management Systems: A Survey,

S. K. Jensen, T. B. Pedersen, and C. Thomsen, “Time Series Management Systems: A Survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 11, pp. 2581–2600, 2017, doi: 10.1109/TKDE.2017.2740932

work page doi:10.1109/tkde.2017.2740932 2017
[7]

Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries,

M. Besta, R. Gerstenberger, E. Peter, M. Fischer, M. Podstawski, C. Barthels, G. Alonso, T. Hoefler, “Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries,” ACM Computing Surveys, vol. 56, no. 2, pp. 1–40, 2023, doi: 10.1145/3604932

work page doi:10.1145/3604932 2023
[8]

Cloud-Native Databases: A Survey,

H. Dong, C. Zhang, G. Li, and H. Zhang, “Cloud-Native Databases: A Survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 12, pp. 7772–7791, 2024, doi: 10.1109/TKDE.2024.3397508

work page doi:10.1109/tkde.2024.3397508 2024
[9]

Data Management in the Cloud: Limitations and Opportunities,

D. J. Abadi, “Data Management in the Cloud: Limitations and Opportunities,” IEEE Data Engineering Bulletin, vol. 32, no. 1, pp. 3–12,
[10]

http://sites.computer.org/debull/A09mar/abadi.pdf
[11]

Data Lakes: A Survey of Functions and Systems,

R. Hai, C. Koutras, C. Quix, M. Jarke, “Data Lakes: A Survey of Functions and Systems,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 12, pp. 12571–12590, 2023, doi: 10.1109/TKDE.2023.3270101

work page doi:10.1109/tkde.2023.3270101 2023
[12]

Highly Available Transactions: Virtues and Limitations,

P. Bailis et al., “Highly Available Transactions: Virtues and Limitations,” Proc. VLDB Endowment, vol. 7, no. 3, pp. 181–192, 2014

2014
[13]

CAP Twelve Years Later: How the Rules Have Changed,

E. Brewer, “CAP Twelve Years Later: How the Rules Have Changed,” Computer, vol. 45, no. 2, pp. 23–29, 2012, doi: 10.1109/MC.2012.37

work page doi:10.1109/mc.2012.37 2012
[14]

Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services,

S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services,” ACM SIGACT News, vol. 33, no. 2, pp. 51–59, Jun. 2002, doi: 10.1145/564585.564601

work page doi:10.1145/564585.564601 2002
[15]

The End of an Architectural Era (It’s Time for a Complete Rewrite),

M. Stonebraker et al., “The End of an Architectural Era (It’s Time for a Complete Rewrite),” Proc. VLDB, Vienna, Austria, 2007, pp. 1150– 1160

2007
[16]

Codd , title =

E. F. Codd, “A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM, vol. 13, no. 6, pp. 377–387, 1970, doi: 10.1145/362384.362685

work page doi:10.1145/362384.362685 1970
[17]

The transaction concept: Virtues and limitations,

J. Gray, “The transaction concept: Virtues and limitations,” in Proc. 7th Int. Conf. Very Large Data Bases (VLDB), Cannes, France, 1981, pp. 144–154, doi: 10.5555/1286831.1286846

work page doi:10.5555/1286831.1286846 1981
[18]

Bigtable: A distributed storage system for structured data,

F. Chang et al., “Bigtable: A distributed storage system for structured data,” in Proc. 7th USENIX Symp. Operating Systems Design and Implementation (OSDI), Seattle, WA, USA, 2006, pp. 205–218

2006
[19]

Dynamo: Amazon’s highly available key-value store,

G. DeCandia et al., “Dynamo: Amazon’s highly available key-value store,” in Proc. 21st ACM Symp. Operating Systems Principles (SOSP), Stevenson, WA, USA, 2007, pp. 205–220

2007
[20]

Column-oriented database systems,

D. J. Abadi, P. A. Boncz, and S. Harizopoulos, “Column-oriented database systems,” Proc. VLDB Endow., vol. 2, no. 2, pp. 1664–1665, 2009

2009
[21]

C-Store: A column-oriented DBMS,

M. Stonebraker et al., “C-Store: A column-oriented DBMS,” in Proc. 31st Int. Conf. Very Large Data Bases (VLDB), Trondheim, Norway, 2005, pp. 553–564

2005
[22]

Integrating compression and execution in column-oriented database systems,

D. J. Abadi, S. Madden, and M. Ferreira, “Integrating compression and execution in column-oriented database systems,” in Proc. 2006 ACM SIGMOD Int. Conf. Management of Data, Chicago, IL, USA, 2006, pp. 671–682, doi: 10.1145/1142473.1142548

work page doi:10.1145/1142473.1142548 2006
[23]

Spanner: Google’s globally distributed database,

J. C. Corbett et al., “Spanner: Google’s globally distributed database,” in Proc. 10th USENIX Symp. Operating Systems Design and Implementation (OSDI), Hollywood, CA, USA, 2012, pp. 251–264

2012
[24]

What’s really new with NewSQL?,

A. Pavlo and M. Aslett, “What’s really new with NewSQL?,” ACM SIGMOD Record, vol. 45, no. 2, pp. 45–55, Jun. 2016, doi: 10.1145/3003665.3003674

work page doi:10.1145/3003665.3003674 2016
[25]

Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs

Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 4, pp. 824–836, Apr. 2020, doi: 10.1109/TPAMI.2018.2889473

work page doi:10.1109/tpami.2018.2889473 2020
[26]

Delta Lake: High-performance ACID table storage over cloud object stores,

M. Armbrust et al., “Delta Lake: High-performance ACID table storage over cloud object stores,” Proc. VLDB Endow., vol. 13, no. 12, pp. 3411–3424, 2020, doi:10.14778/3415478.3415560

work page doi:10.14778/3415478.3415560 2020
[27]

Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics,

M. Armbrust et al., “Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics,” Proc. VLDB Endowment, vol. 14, no. 12, pp. 2935–2948, Aug. 2021

2021
[28]

MapReduce: Simplified Data Processing on Large Clusters,

J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” in Proc. USENIX OSDI, San Francisco, CA, USA, 2004, pp. 137–150

2004
[29]

Apache Spark: A unified engine for big data processing,

M. Zaharia et al., “Apache Spark: A unified engine for big data processing,” Commun. ACM, vol. 59, no. 11, pp. 56–65, Nov. 2016, doi: 10.1145/2934664

work page doi:10.1145/2934664 2016
[30]

Retrieval-augmented generation for knowledge- intensive NLP tasks,

P. Lewis et al., “Retrieval-augmented generation for knowledge- intensive NLP tasks,” in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020, pp. 9459–9474. https://arxiv.org/abs/2005.11401

Pith/arXiv arXiv 2020
[31]

Scalable SQL and NoSQL Data Stores,

R. Cattell, “Scalable SQL and NoSQL Data Stores,” ACM SIGMOD Record, vol. 39, no. 4, pp. 12–27, 2011, doi: 10.1145/1978915.1978919

work page doi:10.1145/1978915.1978919 2011
[32]

The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out- of-order data processing,

T. Akidau et al., “The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out- of-order data processing,” Proc. VLDB Endowment, vol. 8, no. 12, pp. 1792–1803, 2015

2015
[33]

The log-structured merge-tree (LSM-tree)

P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil, “The log-structured merge-tree (LSM-tree),” Acta Informatica, vol. 33, no. 4, pp. 351–385, Jun. 1996, doi: 10.1007/s002360050048

work page doi:10.1007/s002360050048 1996
[34]

Cassandra: A decentralized structured storage system,

A. Lakshman and P. Malik, “Cassandra: A decentralized structured storage system,” ACM SIGOPS Operating Systems Review, vol. 44, no. 2, pp. 35–40, Apr. 2010, doi: 10.1145/1773912.1773922

work page doi:10.1145/1773912.1773922 2010
[35]

Billion-Scale Similarity Search with

J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Trans. Big Data, vol. 7, no. 3, pp. 535–547, Sep. 2021, doi: 10.1109/TBDATA.2019.2921572

work page doi:10.1109/tbdata.2019.2921572 2021
[36]

SQL databases vs. NoSQL databases,

M. Stonebraker, “SQL databases vs. NoSQL databases,” Communications of the ACM, vol. 53, no. 4, pp. 10–11, Apr. 2010, doi: 10.1145/1721654.1721659

work page doi:10.1145/1721654.1721659 2010
[37]

An overview of data warehousing and OLAP technology,

S. Chaudhuri and U. Dayal, “An overview of data warehousing and OLAP technology,” SIGMOD Rec., vol. 26, no. 1, pp. 65–74, Mar. 1997, doi: 10.1145/248603.248616

work page doi:10.1145/248603.248616 1997
[38]

Gartner forecasts worldwide public cloud end-user spending to total $723 billion in 2025,

Gartner, “Gartner forecasts worldwide public cloud end-user spending to total $723 billion in 2025,” Gartner Newsroom, Nov. 19, 2024. Available: https://www.gartner.com/en/newsroom/press-releases/2024-11- 19-gartner-forecasts-worldwide-public-cloud-end-user-spending-to-total- 723-billion-dollars-in-2025

2025
[39]

State of Data + AI,

Databricks, "State of Data + AI," Databricks, 2024. [Online]. Available: https://www.databricks.com/resources/ebook/state-of-data-ai

2024
[40]

Multi-model Databases: A New Journey to Handle the Variety of Data,

J. Lu and I. Holubová, "Multi-model Databases: A New Journey to Handle the Variety of Data," ACM Computing Surveys, vol. 52, no. 3, art. 55, pp. 1–38, July 2019, doi: 10.1145/3323214

work page doi:10.1145/3323214 2019
[41]

An introduction to spatial database systems,

R. H. Güting, “An introduction to spatial database systems,” VLDB Journal, vol. 3, no. 4, pp. 357–399, 1994, doi: 10.1007/BF01231602

work page doi:10.1007/bf01231602 1994
[42]

Bitcoin: A Peer-to-Peer Electronic Cash System,

S. Nakamoto, “Bitcoin: A Peer-to-Peer Electronic Cash System,”
[43]

https://bitcoin.org/bitcoin.pdf
[44]

Blockchain consensus protocols in the wild,

C. Cachin and M. Vukolić, “Blockchain consensus protocols in the wild,” arXiv:1707.01873, 2017

Pith/arXiv arXiv 2017
[45]

Retrieval-Augmented Generation for Large Language Models: A Survey,

Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, and H. Wang, "Retrieval-Augmented Generation for Large Language Models: A Survey," arXiv preprint arXiv:2312.10997, Mar. 2024

Pith/arXiv arXiv 2024
[46]

Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages =

W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, "A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models," in Proc. 30th ACM SIGKDD Conf. Knowledge Discovery and Data Mining, Barcelona, Spain, Aug. 2024, pp. 6491–6501, doi: 10.1145/3637528.3671470

work page doi:10.1145/3637528.3671470 2024
[47]

Approximate Nearest Neighbor Search on High Dimensional Data— Experiments, Analyses, and Improvement,

W. Li, Y. Zhang, Y. Sun, W. Wang, M. Li, W. Zhang, and X. Lin, “Approximate Nearest Neighbor Search on High Dimensional Data— Experiments, Analyses, and Improvement,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 8, pp. 1475–1488, 2020, doi: 10.1109/TKDE.2019.2904062

work page doi:10.1109/tkde.2019.2904062 2020
[48]

Database Meets AI: A Survey,

X. Zhou, C. Chai, G. Li, and J. Sun, “Database Meets AI: A Survey,” IEEE Data Engineering Bulletin, vol. 44, no. 2, pp. 21–35, 2021

2021
[49]

SAP HANA Database: Data Management for Modern Business Applications,

F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and W. Lehner, "SAP HANA Database: Data Management for Modern Business Applications," ACM SIGMOD Record, vol. 40, no. 4, pp. 45–51, Dec. 2011, doi: 10.1145/2094114.2094126

work page doi:10.1145/2094114.2094126 2011
[50]

NoSQL database systems: a survey and decision guidance,

F. Gessert, W. Wingerath, S. Friedrich, and N. Ritter, "NoSQL database systems: a survey and decision guidance," Computer Science — Research and Development, vol. 32, no. 3–4, pp. 353–365, 2017, doi: 10.1007/s00450-016-0334-3

work page doi:10.1007/s00450-016-0334-3 2017
[51]

A Survey on NoSQL Stores,

A. Davoudian, L. Chen, and M. Liu, "A Survey on NoSQL Stores," ACM Computing Surveys, vol. 51, no. 2, art. 40, pp. 1–43, Jun. 2018, doi: 10.1145/3158661

work page doi:10.1145/3158661 2018
[52]

Selecting databases for Polyglot Persistence applications,

N. Roy-Hubara, P. Shoval, and A. Sturm, "Selecting databases for Polyglot Persistence applications," Data & Knowledge Engineering, vol. 137, art. 101950, Jan. 2022, doi: 10.1016/j.datak.2021.101950

work page doi:10.1016/j.datak.2021.101950 2022
[53]

Data Lakehouse: A survey and experimental study,

A. A. Harby and F. Zulkernine, "Data Lakehouse: A survey and experimental study," Information Systems, vol. 127, art. 102460, 2025, doi: 10.1016/j.is.2024.102460

work page doi:10.1016/j.is.2024.102460 2025

[1] [1]

One size fits all: An idea whose time has come and gone,

M. Stonebraker and U. Çetintemel, “One size fits all: An idea whose time has come and gone,” in Proc. IEEE 21st Int. Conf. Data Engineering (ICDE), Tokyo, Japan, 2005, pp. 2–11, doi: 10.1109/ICDE.2005.1

work page doi:10.1109/icde.2005.1 2005

[2] [2]

Architecture of a Database System,

J. M. Hellerstein, M. Stonebraker, and J. Hamilton, “Architecture of a Database System,” Foundations and Trends in Databases, vol. 1, no. 2, pp. 141–259, 2007, doi: 10.1561/1900000002

work page doi:10.1561/1900000002 2007

[3] [3]

Survey of Large-Scale Data Management Systems for Big Data Applications,

L. Wu, L. Yuan, and J. You, “Survey of Large-Scale Data Management Systems for Big Data Applications,” Journal of Computer Science and Technology, vol. 30, no. 1, pp. 163–183, 2015, doi: 10.1007/s11390-015- 1511-8

work page doi:10.1007/s11390-015- 2015

[4] [4]

A Survey on NoSQL Databases,

S. Gajendran, “A Survey on NoSQL Databases,” International Journal of Advanced Research in Computer Science and Software Engineering, vol. 3, no. 2, pp. 192–195, 2013

2013

[5] [5]

Main-Memory Database Systems: An Overview,

F. Faerber, A. Kemper, P.-Å. Larson, J. Levandoski, T. Neumann, A. Pavlo, “Main-Memory Database Systems: An Overview,” Foundations and Trends in Databases, vol. 8, no. 1–2, pp. 1–130, 2017, doi: 10.1561/1900000058

work page doi:10.1561/1900000058 2017

[6] [6]

Time Series Management Systems: A Survey,

S. K. Jensen, T. B. Pedersen, and C. Thomsen, “Time Series Management Systems: A Survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 11, pp. 2581–2600, 2017, doi: 10.1109/TKDE.2017.2740932

work page doi:10.1109/tkde.2017.2740932 2017

[7] [7]

Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries,

M. Besta, R. Gerstenberger, E. Peter, M. Fischer, M. Podstawski, C. Barthels, G. Alonso, T. Hoefler, “Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries,” ACM Computing Surveys, vol. 56, no. 2, pp. 1–40, 2023, doi: 10.1145/3604932

work page doi:10.1145/3604932 2023

[8] [8]

Cloud-Native Databases: A Survey,

H. Dong, C. Zhang, G. Li, and H. Zhang, “Cloud-Native Databases: A Survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 12, pp. 7772–7791, 2024, doi: 10.1109/TKDE.2024.3397508

work page doi:10.1109/tkde.2024.3397508 2024

[9] [9]

Data Management in the Cloud: Limitations and Opportunities,

D. J. Abadi, “Data Management in the Cloud: Limitations and Opportunities,” IEEE Data Engineering Bulletin, vol. 32, no. 1, pp. 3–12,

[10] [10]

http://sites.computer.org/debull/A09mar/abadi.pdf

[11] [11]

Data Lakes: A Survey of Functions and Systems,

R. Hai, C. Koutras, C. Quix, M. Jarke, “Data Lakes: A Survey of Functions and Systems,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 12, pp. 12571–12590, 2023, doi: 10.1109/TKDE.2023.3270101

work page doi:10.1109/tkde.2023.3270101 2023

[12] [12]

Highly Available Transactions: Virtues and Limitations,

P. Bailis et al., “Highly Available Transactions: Virtues and Limitations,” Proc. VLDB Endowment, vol. 7, no. 3, pp. 181–192, 2014

2014

[13] [13]

CAP Twelve Years Later: How the Rules Have Changed,

E. Brewer, “CAP Twelve Years Later: How the Rules Have Changed,” Computer, vol. 45, no. 2, pp. 23–29, 2012, doi: 10.1109/MC.2012.37

work page doi:10.1109/mc.2012.37 2012

[14] [14]

Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services,

S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services,” ACM SIGACT News, vol. 33, no. 2, pp. 51–59, Jun. 2002, doi: 10.1145/564585.564601

work page doi:10.1145/564585.564601 2002

[15] [15]

The End of an Architectural Era (It’s Time for a Complete Rewrite),

M. Stonebraker et al., “The End of an Architectural Era (It’s Time for a Complete Rewrite),” Proc. VLDB, Vienna, Austria, 2007, pp. 1150– 1160

2007

[16] [16]

Codd , title =

E. F. Codd, “A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM, vol. 13, no. 6, pp. 377–387, 1970, doi: 10.1145/362384.362685

work page doi:10.1145/362384.362685 1970

[17] [17]

The transaction concept: Virtues and limitations,

J. Gray, “The transaction concept: Virtues and limitations,” in Proc. 7th Int. Conf. Very Large Data Bases (VLDB), Cannes, France, 1981, pp. 144–154, doi: 10.5555/1286831.1286846

work page doi:10.5555/1286831.1286846 1981

[18] [18]

Bigtable: A distributed storage system for structured data,

F. Chang et al., “Bigtable: A distributed storage system for structured data,” in Proc. 7th USENIX Symp. Operating Systems Design and Implementation (OSDI), Seattle, WA, USA, 2006, pp. 205–218

2006

[19] [19]

Dynamo: Amazon’s highly available key-value store,

G. DeCandia et al., “Dynamo: Amazon’s highly available key-value store,” in Proc. 21st ACM Symp. Operating Systems Principles (SOSP), Stevenson, WA, USA, 2007, pp. 205–220

2007

[20] [20]

Column-oriented database systems,

D. J. Abadi, P. A. Boncz, and S. Harizopoulos, “Column-oriented database systems,” Proc. VLDB Endow., vol. 2, no. 2, pp. 1664–1665, 2009

2009

[21] [21]

C-Store: A column-oriented DBMS,

M. Stonebraker et al., “C-Store: A column-oriented DBMS,” in Proc. 31st Int. Conf. Very Large Data Bases (VLDB), Trondheim, Norway, 2005, pp. 553–564

2005

[22] [22]

Integrating compression and execution in column-oriented database systems,

D. J. Abadi, S. Madden, and M. Ferreira, “Integrating compression and execution in column-oriented database systems,” in Proc. 2006 ACM SIGMOD Int. Conf. Management of Data, Chicago, IL, USA, 2006, pp. 671–682, doi: 10.1145/1142473.1142548

work page doi:10.1145/1142473.1142548 2006

[23] [23]

Spanner: Google’s globally distributed database,

J. C. Corbett et al., “Spanner: Google’s globally distributed database,” in Proc. 10th USENIX Symp. Operating Systems Design and Implementation (OSDI), Hollywood, CA, USA, 2012, pp. 251–264

2012

[24] [24]

What’s really new with NewSQL?,

A. Pavlo and M. Aslett, “What’s really new with NewSQL?,” ACM SIGMOD Record, vol. 45, no. 2, pp. 45–55, Jun. 2016, doi: 10.1145/3003665.3003674

work page doi:10.1145/3003665.3003674 2016

[25] [25]

Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs

Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 4, pp. 824–836, Apr. 2020, doi: 10.1109/TPAMI.2018.2889473

work page doi:10.1109/tpami.2018.2889473 2020

[26] [26]

Delta Lake: High-performance ACID table storage over cloud object stores,

M. Armbrust et al., “Delta Lake: High-performance ACID table storage over cloud object stores,” Proc. VLDB Endow., vol. 13, no. 12, pp. 3411–3424, 2020, doi:10.14778/3415478.3415560

work page doi:10.14778/3415478.3415560 2020

[27] [27]

Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics,

M. Armbrust et al., “Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics,” Proc. VLDB Endowment, vol. 14, no. 12, pp. 2935–2948, Aug. 2021

2021

[28] [28]

MapReduce: Simplified Data Processing on Large Clusters,

J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” in Proc. USENIX OSDI, San Francisco, CA, USA, 2004, pp. 137–150

2004

[29] [29]

Apache Spark: A unified engine for big data processing,

M. Zaharia et al., “Apache Spark: A unified engine for big data processing,” Commun. ACM, vol. 59, no. 11, pp. 56–65, Nov. 2016, doi: 10.1145/2934664

work page doi:10.1145/2934664 2016

[30] [30]

Retrieval-augmented generation for knowledge- intensive NLP tasks,

P. Lewis et al., “Retrieval-augmented generation for knowledge- intensive NLP tasks,” in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020, pp. 9459–9474. https://arxiv.org/abs/2005.11401

Pith/arXiv arXiv 2020

[31] [31]

Scalable SQL and NoSQL Data Stores,

R. Cattell, “Scalable SQL and NoSQL Data Stores,” ACM SIGMOD Record, vol. 39, no. 4, pp. 12–27, 2011, doi: 10.1145/1978915.1978919

work page doi:10.1145/1978915.1978919 2011

[32] [32]

The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out- of-order data processing,

T. Akidau et al., “The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out- of-order data processing,” Proc. VLDB Endowment, vol. 8, no. 12, pp. 1792–1803, 2015

2015

[33] [33]

The log-structured merge-tree (LSM-tree)

P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil, “The log-structured merge-tree (LSM-tree),” Acta Informatica, vol. 33, no. 4, pp. 351–385, Jun. 1996, doi: 10.1007/s002360050048

work page doi:10.1007/s002360050048 1996

[34] [34]

Cassandra: A decentralized structured storage system,

A. Lakshman and P. Malik, “Cassandra: A decentralized structured storage system,” ACM SIGOPS Operating Systems Review, vol. 44, no. 2, pp. 35–40, Apr. 2010, doi: 10.1145/1773912.1773922

work page doi:10.1145/1773912.1773922 2010

[35] [35]

Billion-Scale Similarity Search with

J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Trans. Big Data, vol. 7, no. 3, pp. 535–547, Sep. 2021, doi: 10.1109/TBDATA.2019.2921572

work page doi:10.1109/tbdata.2019.2921572 2021

[36] [36]

SQL databases vs. NoSQL databases,

M. Stonebraker, “SQL databases vs. NoSQL databases,” Communications of the ACM, vol. 53, no. 4, pp. 10–11, Apr. 2010, doi: 10.1145/1721654.1721659

work page doi:10.1145/1721654.1721659 2010

[37] [37]

An overview of data warehousing and OLAP technology,

S. Chaudhuri and U. Dayal, “An overview of data warehousing and OLAP technology,” SIGMOD Rec., vol. 26, no. 1, pp. 65–74, Mar. 1997, doi: 10.1145/248603.248616

work page doi:10.1145/248603.248616 1997

[38] [38]

Gartner forecasts worldwide public cloud end-user spending to total $723 billion in 2025,

Gartner, “Gartner forecasts worldwide public cloud end-user spending to total $723 billion in 2025,” Gartner Newsroom, Nov. 19, 2024. Available: https://www.gartner.com/en/newsroom/press-releases/2024-11- 19-gartner-forecasts-worldwide-public-cloud-end-user-spending-to-total- 723-billion-dollars-in-2025

2025

[39] [39]

State of Data + AI,

Databricks, "State of Data + AI," Databricks, 2024. [Online]. Available: https://www.databricks.com/resources/ebook/state-of-data-ai

2024

[40] [40]

Multi-model Databases: A New Journey to Handle the Variety of Data,

J. Lu and I. Holubová, "Multi-model Databases: A New Journey to Handle the Variety of Data," ACM Computing Surveys, vol. 52, no. 3, art. 55, pp. 1–38, July 2019, doi: 10.1145/3323214

work page doi:10.1145/3323214 2019

[41] [41]

An introduction to spatial database systems,

R. H. Güting, “An introduction to spatial database systems,” VLDB Journal, vol. 3, no. 4, pp. 357–399, 1994, doi: 10.1007/BF01231602

work page doi:10.1007/bf01231602 1994

[42] [42]

Bitcoin: A Peer-to-Peer Electronic Cash System,

S. Nakamoto, “Bitcoin: A Peer-to-Peer Electronic Cash System,”

[43] [43]

https://bitcoin.org/bitcoin.pdf

[44] [44]

Blockchain consensus protocols in the wild,

C. Cachin and M. Vukolić, “Blockchain consensus protocols in the wild,” arXiv:1707.01873, 2017

Pith/arXiv arXiv 2017

[45] [45]

Retrieval-Augmented Generation for Large Language Models: A Survey,

Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, and H. Wang, "Retrieval-Augmented Generation for Large Language Models: A Survey," arXiv preprint arXiv:2312.10997, Mar. 2024

Pith/arXiv arXiv 2024

[46] [46]

Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages =

W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, "A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models," in Proc. 30th ACM SIGKDD Conf. Knowledge Discovery and Data Mining, Barcelona, Spain, Aug. 2024, pp. 6491–6501, doi: 10.1145/3637528.3671470

work page doi:10.1145/3637528.3671470 2024

[47] [47]

Approximate Nearest Neighbor Search on High Dimensional Data— Experiments, Analyses, and Improvement,

W. Li, Y. Zhang, Y. Sun, W. Wang, M. Li, W. Zhang, and X. Lin, “Approximate Nearest Neighbor Search on High Dimensional Data— Experiments, Analyses, and Improvement,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 8, pp. 1475–1488, 2020, doi: 10.1109/TKDE.2019.2904062

work page doi:10.1109/tkde.2019.2904062 2020

[48] [48]

Database Meets AI: A Survey,

X. Zhou, C. Chai, G. Li, and J. Sun, “Database Meets AI: A Survey,” IEEE Data Engineering Bulletin, vol. 44, no. 2, pp. 21–35, 2021

2021

[49] [49]

SAP HANA Database: Data Management for Modern Business Applications,

F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and W. Lehner, "SAP HANA Database: Data Management for Modern Business Applications," ACM SIGMOD Record, vol. 40, no. 4, pp. 45–51, Dec. 2011, doi: 10.1145/2094114.2094126

work page doi:10.1145/2094114.2094126 2011

[50] [50]

NoSQL database systems: a survey and decision guidance,

F. Gessert, W. Wingerath, S. Friedrich, and N. Ritter, "NoSQL database systems: a survey and decision guidance," Computer Science — Research and Development, vol. 32, no. 3–4, pp. 353–365, 2017, doi: 10.1007/s00450-016-0334-3

work page doi:10.1007/s00450-016-0334-3 2017

[51] [51]

A Survey on NoSQL Stores,

A. Davoudian, L. Chen, and M. Liu, "A Survey on NoSQL Stores," ACM Computing Surveys, vol. 51, no. 2, art. 40, pp. 1–43, Jun. 2018, doi: 10.1145/3158661

work page doi:10.1145/3158661 2018

[52] [52]

Selecting databases for Polyglot Persistence applications,

N. Roy-Hubara, P. Shoval, and A. Sturm, "Selecting databases for Polyglot Persistence applications," Data & Knowledge Engineering, vol. 137, art. 101950, Jan. 2022, doi: 10.1016/j.datak.2021.101950

work page doi:10.1016/j.datak.2021.101950 2022

[53] [53]

Data Lakehouse: A survey and experimental study,

A. A. Harby and F. Zulkernine, "Data Lakehouse: A survey and experimental study," Information Systems, vol. 127, art. 102460, 2025, doi: 10.1016/j.is.2024.102460

work page doi:10.1016/j.is.2024.102460 2025