Architectural Evolution and Selection Framework for Database Systems in AI-Ready Data Platforms
Pith reviewed 2026-06-27 18:39 UTC · model grok-4.3
The pith
A framework of nine dimensions and multi-stage selection identifies hybrid polyglot architectures as optimal for AI-ready data platforms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that its cross-paradigm framework, grounded in nine dimensions and a multi-stage process of workload characterization, constraint filtering, and compatibility scoring, enables systematic evaluation of database architectures. When used to compare thirteen paradigms, it uncovers patterns of decoupling storage and compute, workload-driven specialization, and convergence on AI-ready platforms. The framework's application to an enterprise fraud detection scenario shows hybrid polyglot architectures as the preferred solution for multidimensional workloads, leading to a proposed reference architecture integrating lakehouse, feature, and semantic layers.
What carries the argument
Nine architectural dimensions combined with the multi-stage selection process of workload characterization, constraint filtering, and compatibility scoring.
If this is right
- Database selection shifts from ad-hoc intuition to a repeatable, cross-paradigm comparison.
- Hybrid polyglot architectures emerge as the solution for workloads with mixed transactional, analytical, and AI requirements.
- Three evolutionary patterns guide future platform design: storage-compute decoupling, specialization, and integration of AI capabilities.
- A reference architecture using lakehouse storage plus feature and semantic layers supports modern analytics and generation tasks.
Where Pith is reading between the lines
- The same nine dimensions could be tested on workloads outside finance, such as healthcare or e-commerce, to check broader applicability.
- If new database paradigms appear that resist classification along the existing dimensions, the framework would require explicit extension.
- Organizations could embed the multi-stage process into procurement tools to quantify trade-offs before migration decisions.
- The observed convergence toward AI-ready platforms implies that future single-paradigm systems will need native support for feature stores and vector search to remain competitive.
Load-bearing premise
The nine architectural dimensions and the multi-stage selection process are both comprehensive enough and sufficient to capture all decision-relevant factors across transactional, analytical, and AI-oriented database systems.
What would settle it
A deployment of the framework on a multidimensional workload in which the recommended hybrid polyglot architecture underperforms a single-paradigm system on cost, latency, or accuracy metrics.
Figures
read the original abstract
The rise of polyglot data management and AI-ready database architectures has created a complex design space across diverse database paradigms. However, architecture selection in modern enterprise environments continues to rely heavily on ad-hoc engineering intuition, with limited systematic frameworks to guide decision-making across heterogeneous database systems. This paper introduces a unified cross-paradigm evaluation and selection framework for database architecture design in AI-ready data platforms. The framework is based on nine architectural dimensions and incorporates a structured multi-stage selection process involving workload characterization, constraint filtering, and compatibility scoring to enable systematic comparison and decision-making. To ground the framework, we conduct a structured comparative analysis across thirteen major database paradigms spanning transactional, analytical, and AI-oriented systems. This analysis reveals three recurring patterns in database evolution: decoupling of storage and compute, workload-driven specialization, and convergence toward integrated AI-ready platforms. The proposed framework is demonstrated through a representative enterprise case study in financial fraud detection, illustrating how hybrid, polyglot architectures emerge as optimal solutions for multidimensional workload requirements. The cross-paradigm analysis culminates in an AI-ready reference architecture that integrates lakehouse storage, feature processing, and semantic retrieval layers as the unified substrate for modern analytics, machine learning, and Retrieval-Augmented Generation applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a unified cross-paradigm evaluation and selection framework for database architecture design in AI-ready data platforms. The framework is based on nine architectural dimensions and a multi-stage selection process (workload characterization, constraint filtering, compatibility scoring). It performs a comparative analysis across thirteen database paradigms spanning transactional, analytical, and AI-oriented systems, identifies three recurring patterns in database evolution (decoupling of storage and compute, workload-driven specialization, and convergence toward integrated AI-ready platforms), and demonstrates the framework through a financial fraud detection case study, concluding that hybrid polyglot architectures emerge as optimal and proposing an AI-ready reference architecture integrating lakehouse storage, feature processing, and semantic retrieval layers.
Significance. If the nine dimensions and multi-stage process can be shown to be comprehensive and reproducible, the work would supply a valuable systematic alternative to ad-hoc intuition for selecting database architectures in polyglot, AI-integrated environments. The cross-paradigm analysis and reference architecture could usefully inform design decisions for multidimensional workloads. The absence of quantitative validation, benchmarking, or sensitivity analysis, however, limits the strength of these contributions.
major comments (3)
- [Abstract] Abstract: the central claim that the framework enables systematic comparison and reveals hybrid polyglot architectures as optimal rests on application to thirteen paradigms and one case study, yet the abstract (and by extension the manuscript) supplies no quantitative scores, exclusion criteria, error analysis, or comparison against existing selection methods, leaving the optimality conclusion unsupported by visible evidence.
- [Framework description] Framework section (likely §3): the nine architectural dimensions are asserted to be both necessary and sufficient for cross-paradigm comparison, but the manuscript provides no derivation of how the dimensions were selected, no ablation (e.g., effect of removing one dimension on the ranking), and no inter-expert reproducibility test of the compatibility scoring step; this directly undermines the reliability of the multi-stage process.
- [Case study] Case study section: the fraud-detection example concludes that hybrid architectures are optimal, but without sensitivity analysis on the nine dimensions or external benchmarking, the result remains dependent on the untested premise that the chosen dimensions capture every decision-relevant factor.
minor comments (1)
- [Abstract] Abstract: the three reported patterns are presented as independently derived, but it is unclear whether they are restatements of the input dimensions and case-study outcomes rather than externally validated observations.
Simulated Author's Rebuttal
We thank the referee for the insightful comments, which help clarify the scope and limitations of our proposed framework. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] the central claim that the framework enables systematic comparison and reveals hybrid polyglot architectures as optimal rests on application to thirteen paradigms and one case study, yet the abstract supplies no quantitative scores, exclusion criteria, error analysis, or comparison against existing selection methods, leaving the optimality conclusion unsupported by visible evidence.
Authors: We agree that the abstract could be improved to avoid overstating the evidence. The manuscript's contribution is the framework itself and its application, which is qualitative. We will revise the abstract to state that the framework supports systematic comparison and that the case study shows hybrid architectures emerging as suitable, rather than claiming they are optimal in a general sense. revision: yes
-
Referee: [Framework description] the nine architectural dimensions are asserted to be both necessary and sufficient for cross-paradigm comparison, but the manuscript provides no derivation of how the dimensions were selected, no ablation (e.g., effect of removing one dimension on the ranking), and no inter-expert reproducibility test of the compatibility scoring step; this directly undermines the reliability of the multi-stage process.
Authors: The nine dimensions were identified based on a synthesis of database system literature covering transactional, analytical, and AI-oriented paradigms. We will expand the framework section to provide the derivation and sources for the dimensions. We did not include ablation or reproducibility tests, as the work is a proposal for a selection framework rather than an empirical validation study. A new limitations subsection will be added to discuss this. revision: partial
-
Referee: [Case study] the fraud-detection example concludes that hybrid architectures are optimal, but without sensitivity analysis on the nine dimensions or external benchmarking, the result remains dependent on the untested premise that the chosen dimensions capture every decision-relevant factor.
Authors: We acknowledge this limitation in the case study. We will revise the case study to frame the conclusion as an example of how the framework can be applied to identify hybrid solutions for the given workload, rather than a general optimality claim. We will also add a note on the dependence on the chosen dimensions and suggest sensitivity analysis as future work. revision: partial
- The absence of quantitative validation, benchmarking, sensitivity analysis, ablation studies, and inter-expert reproducibility tests, as these would require additional empirical research beyond the current manuscript.
Circularity Check
No significant circularity; framework presented as proposal without self-referential reduction
full rationale
The paper introduces nine architectural dimensions and a multi-stage selection process as the basis for a new framework, then applies them to a comparative analysis of thirteen paradigms and one case study to surface three patterns and an optimal hybrid architecture. No equations, fitted parameters, or self-citations are shown reducing the claimed patterns or optimality conclusion back to the input dimensions by construction. The dimensions are posited rather than derived from the outcomes they evaluate, and the analysis is presented as an application rather than a closed loop. This is a standard proposal of a decision framework with no load-bearing self-definition or renaming of known results as new derivations.
Axiom & Free-Parameter Ledger
axioms (2)
- ad hoc to paper Nine architectural dimensions are both necessary and sufficient for cross-paradigm comparison.
- ad hoc to paper The multi-stage process (characterization, filtering, scoring) produces reliable optimality rankings.
Reference graph
Works this paper leans on
-
[1]
One size fits all: An idea whose time has come and gone,
M. Stonebraker and U. Çetintemel, “One size fits all: An idea whose time has come and gone,” in Proc. IEEE 21st Int. Conf. Data Engineering (ICDE), Tokyo, Japan, 2005, pp. 2–11, doi: 10.1109/ICDE.2005.1
-
[2]
Architecture of a Database System,
J. M. Hellerstein, M. Stonebraker, and J. Hamilton, “Architecture of a Database System,” Foundations and Trends in Databases, vol. 1, no. 2, pp. 141–259, 2007, doi: 10.1561/1900000002
-
[3]
Survey of Large-Scale Data Management Systems for Big Data Applications,
L. Wu, L. Yuan, and J. You, “Survey of Large-Scale Data Management Systems for Big Data Applications,” Journal of Computer Science and Technology, vol. 30, no. 1, pp. 163–183, 2015, doi: 10.1007/s11390-015- 1511-8
-
[4]
A Survey on NoSQL Databases,
S. Gajendran, “A Survey on NoSQL Databases,” International Journal of Advanced Research in Computer Science and Software Engineering, vol. 3, no. 2, pp. 192–195, 2013
2013
-
[5]
Main-Memory Database Systems: An Overview,
F. Faerber, A. Kemper, P.-Å. Larson, J. Levandoski, T. Neumann, A. Pavlo, “Main-Memory Database Systems: An Overview,” Foundations and Trends in Databases, vol. 8, no. 1–2, pp. 1–130, 2017, doi: 10.1561/1900000058
-
[6]
Time Series Management Systems: A Survey,
S. K. Jensen, T. B. Pedersen, and C. Thomsen, “Time Series Management Systems: A Survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 29, no. 11, pp. 2581–2600, 2017, doi: 10.1109/TKDE.2017.2740932
-
[7]
M. Besta, R. Gerstenberger, E. Peter, M. Fischer, M. Podstawski, C. Barthels, G. Alonso, T. Hoefler, “Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries,” ACM Computing Surveys, vol. 56, no. 2, pp. 1–40, 2023, doi: 10.1145/3604932
-
[8]
Cloud-Native Databases: A Survey,
H. Dong, C. Zhang, G. Li, and H. Zhang, “Cloud-Native Databases: A Survey,” IEEE Transactions on Knowledge and Data Engineering, vol. 36, no. 12, pp. 7772–7791, 2024, doi: 10.1109/TKDE.2024.3397508
-
[9]
Data Management in the Cloud: Limitations and Opportunities,
D. J. Abadi, “Data Management in the Cloud: Limitations and Opportunities,” IEEE Data Engineering Bulletin, vol. 32, no. 1, pp. 3–12,
-
[10]
http://sites.computer.org/debull/A09mar/abadi.pdf
-
[11]
Data Lakes: A Survey of Functions and Systems,
R. Hai, C. Koutras, C. Quix, M. Jarke, “Data Lakes: A Survey of Functions and Systems,” IEEE Transactions on Knowledge and Data Engineering, vol. 35, no. 12, pp. 12571–12590, 2023, doi: 10.1109/TKDE.2023.3270101
-
[12]
Highly Available Transactions: Virtues and Limitations,
P. Bailis et al., “Highly Available Transactions: Virtues and Limitations,” Proc. VLDB Endowment, vol. 7, no. 3, pp. 181–192, 2014
2014
-
[13]
CAP Twelve Years Later: How the Rules Have Changed,
E. Brewer, “CAP Twelve Years Later: How the Rules Have Changed,” Computer, vol. 45, no. 2, pp. 23–29, 2012, doi: 10.1109/MC.2012.37
-
[14]
Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services,
S. Gilbert and N. Lynch, “Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services,” ACM SIGACT News, vol. 33, no. 2, pp. 51–59, Jun. 2002, doi: 10.1145/564585.564601
-
[15]
The End of an Architectural Era (It’s Time for a Complete Rewrite),
M. Stonebraker et al., “The End of an Architectural Era (It’s Time for a Complete Rewrite),” Proc. VLDB, Vienna, Austria, 2007, pp. 1150– 1160
2007
-
[16]
E. F. Codd, “A Relational Model of Data for Large Shared Data Banks,” Communications of the ACM, vol. 13, no. 6, pp. 377–387, 1970, doi: 10.1145/362384.362685
-
[17]
The transaction concept: Virtues and limitations,
J. Gray, “The transaction concept: Virtues and limitations,” in Proc. 7th Int. Conf. Very Large Data Bases (VLDB), Cannes, France, 1981, pp. 144–154, doi: 10.5555/1286831.1286846
-
[18]
Bigtable: A distributed storage system for structured data,
F. Chang et al., “Bigtable: A distributed storage system for structured data,” in Proc. 7th USENIX Symp. Operating Systems Design and Implementation (OSDI), Seattle, WA, USA, 2006, pp. 205–218
2006
-
[19]
Dynamo: Amazon’s highly available key-value store,
G. DeCandia et al., “Dynamo: Amazon’s highly available key-value store,” in Proc. 21st ACM Symp. Operating Systems Principles (SOSP), Stevenson, WA, USA, 2007, pp. 205–220
2007
-
[20]
Column-oriented database systems,
D. J. Abadi, P. A. Boncz, and S. Harizopoulos, “Column-oriented database systems,” Proc. VLDB Endow., vol. 2, no. 2, pp. 1664–1665, 2009
2009
-
[21]
C-Store: A column-oriented DBMS,
M. Stonebraker et al., “C-Store: A column-oriented DBMS,” in Proc. 31st Int. Conf. Very Large Data Bases (VLDB), Trondheim, Norway, 2005, pp. 553–564
2005
-
[22]
Integrating compression and execution in column-oriented database systems,
D. J. Abadi, S. Madden, and M. Ferreira, “Integrating compression and execution in column-oriented database systems,” in Proc. 2006 ACM SIGMOD Int. Conf. Management of Data, Chicago, IL, USA, 2006, pp. 671–682, doi: 10.1145/1142473.1142548
-
[23]
Spanner: Google’s globally distributed database,
J. C. Corbett et al., “Spanner: Google’s globally distributed database,” in Proc. 10th USENIX Symp. Operating Systems Design and Implementation (OSDI), Hollywood, CA, USA, 2012, pp. 251–264
2012
-
[24]
What’s really new with NewSQL?,
A. Pavlo and M. Aslett, “What’s really new with NewSQL?,” ACM SIGMOD Record, vol. 45, no. 2, pp. 45–55, Jun. 2016, doi: 10.1145/3003665.3003674
-
[25]
Y. A. Malkov and D. A. Yashunin, “Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, no. 4, pp. 824–836, Apr. 2020, doi: 10.1109/TPAMI.2018.2889473
-
[26]
Delta Lake: High-performance ACID table storage over cloud object stores,
M. Armbrust et al., “Delta Lake: High-performance ACID table storage over cloud object stores,” Proc. VLDB Endow., vol. 13, no. 12, pp. 3411–3424, 2020, doi:10.14778/3415478.3415560
-
[27]
Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics,
M. Armbrust et al., “Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics,” Proc. VLDB Endowment, vol. 14, no. 12, pp. 2935–2948, Aug. 2021
2021
-
[28]
MapReduce: Simplified Data Processing on Large Clusters,
J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” in Proc. USENIX OSDI, San Francisco, CA, USA, 2004, pp. 137–150
2004
-
[29]
Apache Spark: A unified engine for big data processing,
M. Zaharia et al., “Apache Spark: A unified engine for big data processing,” Commun. ACM, vol. 59, no. 11, pp. 56–65, Nov. 2016, doi: 10.1145/2934664
-
[30]
Retrieval-augmented generation for knowledge- intensive NLP tasks,
P. Lewis et al., “Retrieval-augmented generation for knowledge- intensive NLP tasks,” in Advances in Neural Information Processing Systems 33 (NeurIPS 2020), 2020, pp. 9459–9474. https://arxiv.org/abs/2005.11401
Pith/arXiv arXiv 2020
-
[31]
Scalable SQL and NoSQL Data Stores,
R. Cattell, “Scalable SQL and NoSQL Data Stores,” ACM SIGMOD Record, vol. 39, no. 4, pp. 12–27, 2011, doi: 10.1145/1978915.1978919
-
[32]
The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out- of-order data processing,
T. Akidau et al., “The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out- of-order data processing,” Proc. VLDB Endowment, vol. 8, no. 12, pp. 1792–1803, 2015
2015
-
[33]
The log-structured merge-tree (LSM-tree)
P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil, “The log-structured merge-tree (LSM-tree),” Acta Informatica, vol. 33, no. 4, pp. 351–385, Jun. 1996, doi: 10.1007/s002360050048
-
[34]
Cassandra: A decentralized structured storage system,
A. Lakshman and P. Malik, “Cassandra: A decentralized structured storage system,” ACM SIGOPS Operating Systems Review, vol. 44, no. 2, pp. 35–40, Apr. 2010, doi: 10.1145/1773912.1773922
-
[35]
Billion-Scale Similarity Search with
J. Johnson, M. Douze, and H. Jégou, “Billion-scale similarity search with GPUs,” IEEE Trans. Big Data, vol. 7, no. 3, pp. 535–547, Sep. 2021, doi: 10.1109/TBDATA.2019.2921572
-
[36]
SQL databases vs. NoSQL databases,
M. Stonebraker, “SQL databases vs. NoSQL databases,” Communications of the ACM, vol. 53, no. 4, pp. 10–11, Apr. 2010, doi: 10.1145/1721654.1721659
-
[37]
An overview of data warehousing and OLAP technology,
S. Chaudhuri and U. Dayal, “An overview of data warehousing and OLAP technology,” SIGMOD Rec., vol. 26, no. 1, pp. 65–74, Mar. 1997, doi: 10.1145/248603.248616
-
[38]
Gartner forecasts worldwide public cloud end-user spending to total $723 billion in 2025,
Gartner, “Gartner forecasts worldwide public cloud end-user spending to total $723 billion in 2025,” Gartner Newsroom, Nov. 19, 2024. Available: https://www.gartner.com/en/newsroom/press-releases/2024-11- 19-gartner-forecasts-worldwide-public-cloud-end-user-spending-to-total- 723-billion-dollars-in-2025
2025
-
[39]
State of Data + AI,
Databricks, "State of Data + AI," Databricks, 2024. [Online]. Available: https://www.databricks.com/resources/ebook/state-of-data-ai
2024
-
[40]
Multi-model Databases: A New Journey to Handle the Variety of Data,
J. Lu and I. Holubová, "Multi-model Databases: A New Journey to Handle the Variety of Data," ACM Computing Surveys, vol. 52, no. 3, art. 55, pp. 1–38, July 2019, doi: 10.1145/3323214
-
[41]
An introduction to spatial database systems,
R. H. Güting, “An introduction to spatial database systems,” VLDB Journal, vol. 3, no. 4, pp. 357–399, 1994, doi: 10.1007/BF01231602
-
[42]
Bitcoin: A Peer-to-Peer Electronic Cash System,
S. Nakamoto, “Bitcoin: A Peer-to-Peer Electronic Cash System,”
-
[43]
https://bitcoin.org/bitcoin.pdf
-
[44]
Blockchain consensus protocols in the wild,
C. Cachin and M. Vukolić, “Blockchain consensus protocols in the wild,” arXiv:1707.01873, 2017
Pith/arXiv arXiv 2017
-
[45]
Retrieval-Augmented Generation for Large Language Models: A Survey,
Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, and H. Wang, "Retrieval-Augmented Generation for Large Language Models: A Survey," arXiv preprint arXiv:2312.10997, Mar. 2024
Pith/arXiv arXiv 2024
-
[46]
Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages =
W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, "A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models," in Proc. 30th ACM SIGKDD Conf. Knowledge Discovery and Data Mining, Barcelona, Spain, Aug. 2024, pp. 6491–6501, doi: 10.1145/3637528.3671470
-
[47]
W. Li, Y. Zhang, Y. Sun, W. Wang, M. Li, W. Zhang, and X. Lin, “Approximate Nearest Neighbor Search on High Dimensional Data— Experiments, Analyses, and Improvement,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 8, pp. 1475–1488, 2020, doi: 10.1109/TKDE.2019.2904062
-
[48]
Database Meets AI: A Survey,
X. Zhou, C. Chai, G. Li, and J. Sun, “Database Meets AI: A Survey,” IEEE Data Engineering Bulletin, vol. 44, no. 2, pp. 21–35, 2021
2021
-
[49]
SAP HANA Database: Data Management for Modern Business Applications,
F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and W. Lehner, "SAP HANA Database: Data Management for Modern Business Applications," ACM SIGMOD Record, vol. 40, no. 4, pp. 45–51, Dec. 2011, doi: 10.1145/2094114.2094126
-
[50]
NoSQL database systems: a survey and decision guidance,
F. Gessert, W. Wingerath, S. Friedrich, and N. Ritter, "NoSQL database systems: a survey and decision guidance," Computer Science — Research and Development, vol. 32, no. 3–4, pp. 353–365, 2017, doi: 10.1007/s00450-016-0334-3
-
[51]
A. Davoudian, L. Chen, and M. Liu, "A Survey on NoSQL Stores," ACM Computing Surveys, vol. 51, no. 2, art. 40, pp. 1–43, Jun. 2018, doi: 10.1145/3158661
-
[52]
Selecting databases for Polyglot Persistence applications,
N. Roy-Hubara, P. Shoval, and A. Sturm, "Selecting databases for Polyglot Persistence applications," Data & Knowledge Engineering, vol. 137, art. 101950, Jan. 2022, doi: 10.1016/j.datak.2021.101950
-
[53]
Data Lakehouse: A survey and experimental study,
A. A. Harby and F. Zulkernine, "Data Lakehouse: A survey and experimental study," Information Systems, vol. 127, art. 102460, 2025, doi: 10.1016/j.is.2024.102460
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.