pith. sign in

arxiv: 2605.31097 · v1 · pith:K3XZ7ZCBnew · submitted 2026-05-29 · 💻 cs.DB · cs.AI

SpecDB: LLM-Generated Customized Databases via Feature-Oriented Decomposition

Pith reviewed 2026-06-28 20:33 UTC · model grok-4.3

classification 💻 cs.DB cs.AI
keywords LLM-generated databasesfeature-oriented decompositioncustomized relational databasesTPC-C benchmarkdatabase synthesisFODA feature modelagent-based code generationmodular database design
0
0 comments X

The pith

LLMs can generate a workload-matched database that matches PostgreSQL throughput on TPC-C while using roughly three percent of the code.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Mainstream databases ship many subsystems that most workloads never touch. The paper asks whether a database can instead be built on demand so that its feature set exactly fits the target workload. It decomposes nine production systems into ten modules and their implementation variants, then uses an extended feature model to record dependencies that cross module boundaries. A pipeline of LLM subagents assembles the modules and a separate refining agent repairs integration problems against a supplied test harness. The resulting 23,779-line Rust program completes a sixty-minute TPC-C run at one and ten warehouses with zero errors and delivers throughput within a few percent of PostgreSQL and MySQL.

Core claim

By surveying nine production databases, decomposing them into ten functional modules with multiple implementation variants, extending the FODA feature model with cooperate edges to capture cross-module constraints, and driving the resulting DBGraph with a layered agent pipeline plus an iterative refining agent, it is possible to synthesize from a natural-language workload description a complete, correct, and deployable relational database whose performance on TPC-C is comparable to established engines.

What carries the argument

DBGraph, the dependency graph formed by extending the FODA feature model with cooperate edges that link implementation choices required to work together across modules.

If this is right

  • A natural-language description of a workload can be translated directly into a ready-to-run database binary.
  • Implementation techniques previously scattered across separate source trees can be combined inside one generated product.
  • The final artifact can be orders of magnitude smaller than conventional database engines while retaining equivalent performance on the chosen workload.
  • Iterative repair against a read-only harness can remove defects introduced during module assembly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Workloads too narrow to justify a full commercial database might still receive a purpose-built implementation at low cost.
  • The same decomposition and agent pipeline could be applied to other large, feature-rich software systems outside the database domain.
  • Repeated regeneration of the database as workload characteristics change would become practical if LLM inference costs continue to fall.

Load-bearing premise

The LLM subagents and refining agent can produce and integrate module code that satisfies every cross-module dependency recorded in the model and passes the user-supplied validation harness without hidden defects.

What would settle it

A second independent workload description fed to the same pipeline produces a database that either fails to complete its benchmark or returns incorrect results on any transaction.

Figures

Figures reproduced from arXiv: 2605.31097 by Longbin Lai, Shunyang Li, Ying Zhang, Yunkai Lou, Zhengping Qian.

Figure 1
Figure 1. Figure 1: Refining progression: per-transaction mean latency (top) and tpmC throughput (bottom) across 12 repre [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Result. After 8 merge passes (one per additional database), the unified module tree contains 10 top-level mod￾ules (architecture, concurrency, core, indexing, parser, query_processing, recovery, replication, storage, value_types) and several hundred leaves [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 2
Figure 2. Figure 2: Two representative branches of the SPECDB module tree—concurrency and storage—illustrating the notation used throughout. Node shapes distinguish modules (rectangles, internal subtrees) from files (ovals, leaves). On parent–child edges, a filled circle (•) marks the child as mandatory and an empty circle (◦) marks it optional; a filled arc spanning a sibling group marks an alternative (XOR, exactly one sele… view at source ↗
Figure 3
Figure 3. Figure 3: End-to-end SPECDB workflow. Skills and agents (shaded) operate over DBMODULEGRAPH: specdb-select maps a workload to a node selection, specdb-decycle merges cooperate cycles into contract nodes, specdb-construct drives the layered module construction, and the specdb-refining closes the loop against a user harness. The dashed arrow indicates that the specdb-refining may re-invoke specdb-construct when a miss… view at source ↗
Figure 4
Figure 4. Figure 4: specdb-decycle: a 2-leaf cooperate component binding the inplace MVCC variant (xmin/xmax tuple header) and the Repeatable Read isolation level (left) collapses into the single contract node contract/inplace_rr_coord (right). The contract carries the visibility predicate (a row is visible iff its xmin committed before the transaction’s snapshot and its xmax did not) that neither member specifies alone. Requ… view at source ↗
read the original abstract

Mainstream relational databases ship a uniform feature set across deployments, although individual workloads exercise only a fraction of the available subsystems. We investigate whether a database can instead be generated on demand with a feature set matched to the target workload. We present SpecDB, a system that uses large language models (LLMs) to synthesize customized relational databases. We survey 9 production systems and decompose them into 10 functional modules, each further divided into implementation variants. To capture cross-module dependencies, including cases where implementations in disjoint subtrees must be co-designed, we adopt the FODA feature model and extend it with a cooperate edge, yielding a dependency graph DBGraph. SpecDB operationalizes DBGraph through a layered module-construction pipeline in which each module is generated, validated, and integrated by a dedicated subagent (driven by three inner agents: Main, Tester, Architect), and a Refining Agent that iteratively repairs and tunes the assembled database against a user-supplied refining harness with read-only access to existing database source code. A companion selection component translates a natural-language workload description into a set of implementation variants, providing an end-to-end pipeline from workload description to deployable database. We evaluate SpecDB on TPC-C with BenchmarkSQL. The generated database (23,779 lines of Rust) completes 60-minute TPC-C at 1 and 10 warehouses with zero errors. At 10 warehouses it reaches tpmC=130, compared to 128 for PostgreSQL and 127 for MySQL, with comparable latency at ~3% of their code size. Because the agent operates at module-specification level rather than product source, it can in principle combine techniques across system boundaries. Paired with falling LLM costs, generating a purpose-built database for a target workload is becoming straightforward.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that SpecDB uses LLMs to synthesize customized relational databases from natural-language workload descriptions. It decomposes 9 production systems into 10 modules with implementation variants, extends the FODA model with cooperate edges to form DBGraph for capturing cross-module dependencies, and employs a three-layer agent pipeline (Main/Tester/Architect per module plus a Refining Agent with read-only source access) to generate, validate, and iteratively repair a Rust implementation. On TPC-C with BenchmarkSQL, the resulting 23,779-line database completes 60-minute runs at 1 and 10 warehouses with zero errors, achieving tpmC=130 at 10 warehouses (vs. 128 for PostgreSQL and 127 for MySQL) with comparable latency at ~3% of their code size.

Significance. If the central empirical claim holds, the work would be significant for demonstrating an end-to-end pipeline from workload description to a deployable, high-performance database via LLM agents operating at the module-specification level. The DBGraph extension for co-design dependencies and the agent-based synthesis provide a concrete operationalization of feature-oriented database generation. The TPC-C result supplies a benchmark data point showing competitive throughput with dramatically reduced code size, which, paired with the reproducible construction approach, could stimulate further research on automated, workload-specific systems.

major comments (3)
  1. [Abstract] Abstract: The concrete TPC-C performance numbers (zero-error 60-min runs, tpmC=130 at 10 warehouses) are presented without any information on the number of experimental runs, statistical significance, variance, prompt engineering details, or failure modes of the subagents, leaving the central performance claim with only weak evidential grounding.
  2. [Refining Agent description] Section describing the Refining Agent: The manuscript states that the Refining Agent has only read-only access to existing source code and is driven by the same LLM family; this does not establish that subtle concurrency, transaction, or index-interaction defects violating DBGraph cooperate edges will be detected if they evade the BenchmarkSQL workload used for refinement.
  3. [Evaluation] Evaluation section: No details are supplied on how the refining harness was constructed or on any validation that cross-module dependencies captured by the extended FODA model are satisfied in the final generated code, which is load-bearing for the zero-error and tpmC claims.
minor comments (1)
  1. [Abstract] The abstract would benefit from explicitly stating the number of modules and variants surveyed from the 9 production systems to clarify the scope of the decomposition.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important gaps in experimental reporting and validation details that we will address in revision. Below we respond point-by-point to the three major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The concrete TPC-C performance numbers (zero-error 60-min runs, tpmC=130 at 10 warehouses) are presented without any information on the number of experimental runs, statistical significance, variance, prompt engineering details, or failure modes of the subagents, leaving the central performance claim with only weak evidential grounding.

    Authors: We agree the current presentation provides insufficient experimental context. In the revised manuscript we will expand both the abstract and Evaluation section to report: (1) that a single end-to-end generation and single 60-minute TPC-C run per warehouse count was performed (due to the high cost of LLM-based synthesis and validation), (2) the specific prompt-engineering patterns used for the subagents, and (3) the main failure modes observed during iterative refinement (e.g., transaction isolation violations caught by the Tester agent). We will also note the absence of statistical significance testing as a limitation of the current evaluation. revision: yes

  2. Referee: [Refining Agent description] Section describing the Refining Agent: The manuscript states that the Refining Agent has only read-only access to existing source code and is driven by the same LLM family; this does not establish that subtle concurrency, transaction, or index-interaction defects violating DBGraph cooperate edges will be detected if they evade the BenchmarkSQL workload used for refinement.

    Authors: The observation is correct. The Refining Agent operates under read-only constraints and relies on the BenchmarkSQL workload to surface defects; it therefore cannot guarantee detection of all possible violations of DBGraph cooperate edges. We will revise the Refining Agent section to explicitly acknowledge this scope limitation and to clarify that the reported zero-error result holds only for the exercised BenchmarkSQL workload. We maintain that this is still a meaningful practical validation for a workload-specific database, but we will not claim exhaustive coverage of all cross-module interactions. revision: yes

  3. Referee: [Evaluation] Evaluation section: No details are supplied on how the refining harness was constructed or on any validation that cross-module dependencies captured by the extended FODA model are satisfied in the final generated code, which is load-bearing for the zero-error and tpmC claims.

    Authors: We accept that the Evaluation section lacks necessary detail on harness construction and dependency validation. In revision we will add: (1) a description of how the BenchmarkSQL refining harness was configured for the generated Rust database (including any required schema adaptations and connection parameters), and (2) the concrete checks performed by the Refining Agent and Architect agents to verify that DBGraph cooperate edges are respected in the assembled code. These additions will directly support the zero-error and performance claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system construction with direct benchmark measurement

full rationale

The paper presents an LLM-driven pipeline for synthesizing a customized database from a feature model (extended FODA/DBGraph) and reports direct empirical results on TPC-C via BenchmarkSQL. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear. The central claim (zero-error 60-min run at tpmC=130) is a measured outcome of the constructed artifact, not a reduction to prior definitions or fits. The reader's circularity score of 0.0 is confirmed; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces an engineering pipeline rather than a mathematical model; the decomposition into ten modules and the addition of cooperate edges are author-chosen constructs without external validation or free parameters fitted to data.

axioms (1)
  • domain assumption The extended FODA model with cooperate edges is sufficient to capture all cross-module dependencies that affect correctness or performance in relational database implementations.
    Invoked when constructing DBGraph and when the pipeline assembles modules.

pith-pipeline@v0.9.1-grok · 5863 in / 1402 out tokens · 33837 ms · 2026-06-28T20:33:13.237759+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    https://www.postgresql.org/, 2026

    PostgreSQL: The world’s most advanced open source relational database. https://www.postgresql.org/, 2026

  2. [2]

    Accessed: 2025-12-28

    Oracle Corporation.MySQL 8.0 Reference Manual, 2025. Accessed: 2025-12-28

  3. [3]

    AutoAdmin project at Microsoft Research: Lessons learned.IEEE Data Engineering Bulletin, 34(3):12–19, 2011

    Nicolas Bruno, Surajit Chaudhuri, Arnd Christian König, Vivek Narasayya, Ravi Ramamurthy, and Manoj Syamala. AutoAdmin project at Microsoft Research: Lessons learned.IEEE Data Engineering Bulletin, 34(3):12–19, 2011

  4. [4]

    Gordon, and Bohan Zhang

    Dana Van Aken, Andrew Pavlo, Geoffrey J. Gordon, and Bohan Zhang. Automatic database management system tuning through large-scale machine learning. InProceedings of the 2017 ACM International Conference on Management of Data (SIGMOD), pages 1009–1024, 2017

  5. [5]

    Mowry, Matthew Perron, Ian Quah, Siddharth Santurkar, Anthony Tomasic, Skye Toor, Dana Van Aken, Ziqi Wang, Yingjun Wu, Ran Xian, and Tieying Zhang

    Andrew Pavlo, Gustavo Angulo, Joy Arulraj, Haibin Lin, Jiexi Lin, Lin Ma, Prashanth Menon, Todd C. Mowry, Matthew Perron, Ian Quah, Siddharth Santurkar, Anthony Tomasic, Skye Toor, Dana Van Aken, Ziqi Wang, Yingjun Wu, Ran Xian, and Tieying Zhang. Self-driving database management systems. In8th Biennial Conference on Innovative Data Systems Research (CIDR), 2017

  6. [6]

    Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland

    Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. The end of an architectural era (It’s Time for a Complete Rewrite). InProceedings of the 33rd International Conference on Very Large Data Bases (VLDB), pages 1150–1160, 2007

  7. [7]

    Sharpen the spec, cut the code: A case for generative file system with {SYSSPEC}

    Qingyuan Liu, Mo Zou, Hengbin Zhang, Dong Du, Yubin Xia, and Haibo Chen. Sharpen the spec, cut the code: A case for generative file system with {SYSSPEC}. In24th USENIX Conference on File and Storage Technologies (FAST 26), pages 291–311, 2026

  8. [8]

    Un- veiling inefficiencies in llm-generated code: Toward a comprehensive taxonomy

    Altaf Allah Abbassi, Leuson Da Silva, Amin Nikanjam, and Foutse Khomh. A taxonomy of inefficiencies in LLM-generated Python code. arXiv:2503.06327, 2025

  9. [9]

    How efficient is LLM-generated code? a rigorous & high-standard benchmark

    Ruizhong Qiu, Weiliang Will Zeng, James Ezick, Christopher Lott, and Hanghang Tong. How efficient is LLM-generated code? a rigorous & high-standard benchmark. InThe Thirteenth International Conference on Learning Representations (ICLR), 2025

  10. [10]

    Feature-oriented domain analysis (foda) feasibility study

    Kyo C Kang, Sholom G Cohen, James A Hess, William E Novak, and A Spencer Peterson. Feature-oriented domain analysis (foda) feasibility study. Technical report, 1990

  11. [11]

    Github copilot.https://github.com/features/copilot, 2026

  12. [12]

    Claude.https://www.anthropic.com/claude, 2026

  13. [13]

    Bespoke OLAP: Synthesizing workload- specific one-size-fits-one database engines.Proceedings of the VLDB Endowment, 14(1), 2026

    Johannes Wehrstein, Timo Eckmann, Matthias Jasny, and Carsten Binnig. Bespoke OLAP: Synthesizing workload- specific one-size-fits-one database engines.Proceedings of the VLDB Endowment, 14(1), 2026. arXiv:2603.02001

  14. [14]

    ChatDev: Communicative agents for software development

    Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. ChatDev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

  15. [15]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, Lingfeng Xiao, Chenglin Wu, and Jürgen Schmidhuber. MetaGPT: Meta programming for a multi-agent collaborative framework. arXiv:2308.00352, 2023. 21 SPECDB : LLM-Generated Customized Databases ...

  16. [16]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W. White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation. arXiv:2308.08155, 2023

  17. [17]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

  18. [18]

    Teaching large language models to self-debug

    Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

  19. [19]

    ".join(map(str, arr))] 10return

    Tal Ridnik, Dedy Kredo, and Itamar Friedman. Code generation with AlphaCodium: From prompt engineering to flow engineering. arXiv:2401.08500, 2024

  20. [20]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering. InAdvances in Neural Information Processing Systems (NeurIPS), 2024

  21. [21]

    BenchmarkSQL: Tpc-c benchmark tool.https://github.com/pgsql-io/benchmarksql, 2024

  22. [22]

    Tpc-c.http://www.tpc.org/tpcc/, 2026

  23. [23]

    Jones, Samuel Madden, Michael Stonebraker, Yang Zhang, John Hugg, and Daniel J

    Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander Rasin, Stanley Zdonik, Evan P.C. Jones, Samuel Madden, Michael Stonebraker, Yang Zhang, John Hugg, and Daniel J. Abadi. H-Store: A high- performance, distributed main memory transaction processing system.Proceedings of the VLDB Endowment, 1(2):1496–1499, 2008. A Per-database Extract...