pith. sign in

arxiv: 2604.16373 · v1 · submitted 2026-03-23 · 💻 cs.DB · cs.SE

DIRT: Database-Integrated Random Testing

Pith reviewed 2026-05-15 00:34 UTC · model grok-4.3

classification 💻 cs.DB cs.SE
keywords database testingrandom testingintegrated testinggeneration actionsbug detectionDBMS developmentfalse positivesproperty-based testing
0
0 comments X

The pith

Integrating random testing directly into the DBMS reduces false positives and finds more bugs during active development.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Database systems are difficult to test while features are still incomplete because external tools produce many invalid inputs and false alarms. DIRT moves the random tester inside the database engine so testing code evolves in lockstep with the implementation. Developers define generation actions that encode correctness properties for the current state of the system, which by construction keeps generated tests valid and actionable. On the Turso SQLite-compatible engine this method located 23 unique confirmed bugs and delivered markedly higher true-positive rates than off-the-shelf SQLancer variants. The central insight is that embedding the tester removes the mismatch between test generator and evolving target that causes most false positives.

Core claim

DIRT integrates a random testing framework directly into the DBMS, allowing database developers to specify correctness properties through generation actions rather than relying on external testing experts. Because the testing infrastructure lives inside the engine, the random testing process can change whenever the database code changes, eliminating many sources of false positives by construction and producing bug reports that remain relevant to the current development state. Evaluation on Turso showed 23 unique confirmed bugs together with higher true-positive rates and more useful reports than external SQLancer variants.

What carries the argument

Generation actions: an abstraction that lets database developers encode correctness properties so random test inputs and oracles stay aligned with the current, possibly incomplete implementation.

If this is right

  • Testing remains effective even when many features are incomplete or changing.
  • Bug reports become more actionable because generated cases are tailored to the current engine state.
  • Random testing can be introduced earlier in the development cycle without the usual flood of noise.
  • The same integrated approach applies to other actively developed database engines beyond the Turso case study.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The integration pattern could transfer to testing other complex, stateful systems such as distributed services or kernels where external oracles quickly become stale.
  • It may reduce dependence on separate testing specialists by letting implementers maintain the correctness model themselves.
  • Automated extraction or synthesis of generation actions from schemas or existing code could further lower the effort required.

Load-bearing premise

Database developers can correctly and comprehensively specify correctness properties using generation actions without missing key cases or introducing new errors in the testing setup.

What would settle it

A head-to-head run on a mature, stable DBMS in which DIRT reports a false-positive rate equal to or higher than external tools such as SQLancer, or an implementation where generation actions miss known bugs that external oracles detect.

Figures

Figures reproduced from arXiv: 2604.16373 by Alperen Keles, Ethan Chou, Harrison Goldstein, Leonidas Lampropoulos.

Figure 1
Figure 1. Figure 1: Pivoted Query Synthesis as a universally quantified property (left) and as a generation action (right). [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Definitions of Oracles as Generation Actions [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of True Positives, False Positives and No Bugs for [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Database management systems (DBMSs) are notoriously complex, making them difficult to test effectively, especially during early development when many features are incomplete. Traditional testing tools like SQLancer and SQLSmith are highly effective for mature databases, but they struggle with high false positive rates and low actionability when applied to evolving systems. We present DIRT, a paradigm designed specifically for testing databases during development, which integrates a testing framework directly into the DBMS, enabling the random testing process to evolve in tandem with the system and reducing false positives by construction. We introduce generation actions, an abstraction for allowing database developers rather than testing experts to specify correctness properties. We evaluate DIRT on Turso, an actively developed SQLite-compatible OLTP engine, and show that it finds 23 unique, confirmed bugs--significantly outperforming off-the-shelf SQLancer variants in terms of true positive rate and usefulness of bug reports. Our results demonstrate that embedding testing infrastructure within the DBMS can dramatically improve its effectiveness and usability during development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces DIRT, a paradigm for integrating random testing infrastructure directly into a DBMS to support testing during early development. It defines generation actions as an abstraction allowing database developers to specify correctness properties, claims this reduces false positives by construction, and evaluates the approach on Turso (an actively developed SQLite-compatible OLTP engine), reporting 23 unique confirmed bugs and higher true-positive rates plus more actionable reports than off-the-shelf SQLancer variants.

Significance. If the empirical results and methodology hold, embedding testing within the DBMS could meaningfully improve bug detection effectiveness and usability for evolving systems where traditional external tools produce high false-positive rates. The concrete outcome of 23 confirmed bugs on Turso provides a practical demonstration that developer-specified generation actions can yield usable findings during active development.

major comments (2)
  1. [Evaluation section] Evaluation section: the abstract states that DIRT finds 23 unique confirmed bugs on Turso and outperforms SQLancer variants in true-positive rate, yet supplies no information on experimental controls, bug confirmation criteria, quantitative metrics (e.g., number of queries or runs), or potential confounds such as differing coverage or oracle strength. This absence makes the central claim of improved effectiveness difficult to assess or reproduce.
  2. [Generation actions definition] Generation actions definition (early sections): the claim that embedding reduces false positives by construction rests on the assumption that developers can comprehensively and correctly encode correctness properties via generation actions without omitting key cases or introducing new errors; the manuscript provides no discussion, validation, or counter-example analysis of this assumption despite it being load-bearing for the usability argument.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it briefly indicated the scale of the Turso evaluation (e.g., number of test runs or total queries generated) to give readers an immediate sense of experimental effort.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below, indicating revisions where the manuscript will be updated.

read point-by-point responses
  1. Referee: [Evaluation section] Evaluation section: the abstract states that DIRT finds 23 unique confirmed bugs on Turso and outperforms SQLancer variants in true-positive rate, yet supplies no information on experimental controls, bug confirmation criteria, quantitative metrics (e.g., number of queries or runs), or potential confounds such as differing coverage or oracle strength. This absence makes the central claim of improved effectiveness difficult to assess or reproduce.

    Authors: We agree that the Evaluation section requires additional detail to support reproducibility and assessment of the claims. In the revised manuscript we expand this section with descriptions of the experimental controls (identical hardware and time budgets across tools), the bug confirmation process (reproduction by Turso developers plus independent verification), quantitative metrics (total queries generated and number of runs), and discussion of confounds including coverage measurement and oracle differences. These changes directly address the gaps noted. revision: yes

  2. Referee: [Generation actions definition] Generation actions definition (early sections): the claim that embedding reduces false positives by construction rests on the assumption that developers can comprehensively and correctly encode correctness properties via generation actions without omitting key cases or introducing new errors; the manuscript provides no discussion, validation, or counter-example analysis of this assumption despite it being load-bearing for the usability argument.

    Authors: The referee correctly identifies that the reduction in false positives by construction depends on developers specifying generation actions effectively. The original manuscript supplies concrete examples from Turso but lacks explicit analysis of the assumption. We have added a new subsection discussing design principles intended to limit specification errors, the iterative refinement possible during development, and a brief analysis of cases where initial actions were incomplete. We maintain the core claim relative to external tools but acknowledge this is an assumption requiring ongoing validation. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical system integration and evaluation on Turso, reporting 23 confirmed bugs and improved true-positive rates versus external SQLancer variants. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the provided material that would reduce any claimed result to its inputs by construction. The central claims rest on concrete bug reports and comparative metrics rather than internal reductions or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the feasibility of modifying a DBMS to host testing infrastructure and on developers' ability to use the new generation actions abstraction effectively. No numerical free parameters are described. One domain assumption and one invented entity are identified.

axioms (1)
  • domain assumption It is feasible to integrate a testing framework directly into the DBMS without prohibitive performance or complexity costs during active development.
    Required for the paradigm to function as described in the abstract.
invented entities (1)
  • generation actions no independent evidence
    purpose: Abstraction allowing database developers to specify correctness properties instead of testing experts
    New concept introduced to enable the integrated testing approach.

pith-pipeline@v0.9.0 · 5471 in / 1198 out tokens · 48374 ms · 2026-05-15T00:34:04.726886+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    In Proceedings of the 45th International Conference on Software Engineering(2023), ICSE ’23, IEEE Press, p

    Ba, J., and Rigger, M.Testing database engines via query plan guidance. In Proceedings of the 45th International Conference on Software Engineering(2023), ICSE ’23, IEEE Press, p. 2060–2071

  2. [2]

    InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(New York, NY, USA, 2024), ICSE ’24, Association for Computing Machinery

    Ba, J., and Rigger, M.Cert: Finding performance issues in database systems through the lens of cardinality estimation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(New York, NY, USA, 2024), ICSE ’24, Association for Computing Machinery

  3. [3]

    ACM Manag

    Ba, J., and Rigger, M.Keep it simple: Testing databases via differential query plans.Proc. ACM Manag. Data 2, 3 (May 2024)

  4. [4]

    InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security(New York, NY, USA, 2016), CCS ’16, Association for Computing Machinery, p

    Böhme, M., Pham, V.-T., and Roychoudhury, A.Coverage-based greybox fuzzing as markov chain. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security(New York, NY, USA, 2016), CCS ’16, Association for Computing Machinery, p. 1032–1043

  5. [5]

    35, 9 (Sept

    Claessen, K., and Hughes, J.Quickcheck: a lightweight tool for random testing of haskell programs.SIGPLAN Not. 35, 9 (Sept. 2000), 268–279

  6. [6]

    In14th USENIX Workshop on Offensive Technologies (WOOT 20)(Aug

    Fioraldi, A., Maier, D., Eissfeldt, H., and Heuse, M.AFL++: Combining incremental steps of fuzzing research. In14th USENIX Workshop on Offensive Technologies (WOOT 20)(Aug. 2020), USENIX Association

  7. [7]

    InProceedings of the 37th IEEE/ACM International Conference on Auto- mated Software Engineering(New York, NY, USA, 2023), ASE ’22, Association for Computing Machinery

    Fu, J., Liang, J., Wu, Z., W ang, M., and Jiang, Y.Griffin : Grammar-free dbms fuzzing. InProceedings of the 37th IEEE/ACM International Conference on Auto- mated Software Engineering(New York, NY, USA, 2023), ASE ’22, Association for Computing Machinery

  8. [8]

    In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)(Los Alamitos, CA, USA, May 2025), IEEE Computer Society, pp

    Fu, Y., Wu, Z., Zhang, Y., Liang, J., Fu, J., Jiang, Y., Li, S., and Liao, X.Thanos: DBMS Bug Detection via Storage Engine Rotation Based Differential Testing . In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)(Los Alamitos, CA, USA, May 2025), IEEE Computer Society, pp. 655–666

  9. [9]

    W., Dickstein, D., Pierce, B

    Goldstein, H., Cutler, J. W., Dickstein, D., Pierce, B. C., and Head, A. Property-based testing in practice. InProceedings of the IEEE/ACM 46th In- ternational Conference on Software Engineering(New York, NY, USA, 2024), ICSE ’24, Association for Computing Machinery. [10]Hipp, R. D.SQLite, 2020

  10. [10]

    C., Spector-Zabusky, A., Vytiniotis, D., Azevedo de Amorim, A., and Lampropoulos, L.Testing noninterference, quickly

    Hritcu, C., Hughes, J., Pierce, B. C., Spector-Zabusky, A., Vytiniotis, D., Azevedo de Amorim, A., and Lampropoulos, L.Testing noninterference, quickly. SIGPLAN Not. 48, 9 (Sept. 2013), 455–468

  11. [11]

    Hughes, J.How to specify it! a guide to writing properties of pure functions. In Trends in Functional Programming: 20th International Symposium, TFP 2019, Van- couver, BC, Canada, June 12–14, 2019, Revised Selected Papers(Berlin, Heidelberg, 2019), Springer-Verlag, p. 58–83

  12. [12]

    Jung, J., Hu, H., Arulraj, J., Kim, T., and Kang, W.APOLLO: automatic detection and diagnosis of performance regressions in database systems.PVLDB 13, 1 (2019), 57–70

  13. [13]

    C.Coverage guided, property based testing.Proc

    Lampropoulos, L., Hicks, M., and Pierce, B. C.Coverage guided, property based testing.Proc. ACM Program. Lang. 3, OOPSLA (Oct. 2019)

  14. [14]

    C.Generating good generators for inductive relations.Proc

    Lampropoulos, L., Paraskevopoulou, Z., and Pierce, B. C.Generating good generators for inductive relations.Proc. ACM Program. Lang. 2, POPL (Dec. 2017)

  15. [15]

    InProceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis(New York, NY, USA, 2018), ISSTA 2018, Association for Computing Machinery, p

    Lemieux, C., Padhye, R., Sen, K., and Song, D.Perffuzz: automatically generat- ing pathological inputs. InProceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis(New York, NY, USA, 2018), ISSTA 2018, Association for Computing Machinery, p. 254–265

  16. [16]

    InProceedings of the 31st USENIX Security Symposium (USENIX 2022)(Boston, MA, aug 2022)

    Liang, Y., Liu, S., and Hu, H.Detecting Logical Bugs of DBMS with Coverage- based Guidance. InProceedings of the 31st USENIX Security Symposium (USENIX 2022)(Boston, MA, aug 2022)

  17. [17]

    InProceedings of the 28th ACM SIGSOFT International Sympo- sium on Software Testing and Analysis(New York, NY, USA, 2019), ISSTA 2019, Association for Computing Machinery, p

    Padhye, R., Lemieux, C., Sen, K., Papadakis, M., and Le Traon, Y.Semantic fuzzing with zest. InProceedings of the 28th ACM SIGSOFT International Sympo- sium on Software Testing and Analysis(New York, NY, USA, 2019), ISSTA 2019, Association for Computing Machinery, p. 329–340

  18. [18]

    ACM Program

    Padhye, R., Lemieux, C., Sen, K., Simon, L., and Vijayakumar, H.Fuzzfactory: domain-specific fuzzing with waypoints.Proc. ACM Program. Lang. 3, OOPSLA (Oct. 2019)

  19. [19]

    Rigger, M., and Su, Z.Detecting optimization bugs in database engines via non-optimizing reference engine construction. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(New York, NY, USA, 2020), ESEC/FSE 2020, Association for Computing Machinery, p. 1140–1152

  20. [20]

    Rigger, M., and Su, Z.Finding bugs in database systems via query partitioning. Proc. ACM Program. Lang. 4, OOPSLA (Nov. 2020)

  21. [21]

    In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)(Nov

    Rigger, M., and Su, Z.Testing database engines via pivoted query synthesis. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)(Nov. 2020), USENIX Association, pp. 667–682

  22. [22]

    InProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation(USA, 2020), OSDI’20, USENIX Association

    Rigger, M., and Su, Z.Testing database engines via pivoted query synthesis. InProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation(USA, 2020), OSDI’20, USENIX Association. [24]Seltenreich, A.Sqlsmith. https://github.com/anse1/sqlsmith, 2015

  23. [23]

    CISPA Helmholtz Center for Information Security, 2024

    Zeller, A., Gopinath, R., Böhme, M., Fraser, G., and Holler, C.The Fuzzing Book. CISPA Helmholtz Center for Information Security, 2024. Retrieved 2024- 07-01 16:50:18+02:00. Conference’17, July 2017, Washington, DC, USA Alperen Keles, Ethan Chou, Harrison Goldstein, and Leonidas Lampropoulos

  24. [24]

    Zhang, C., and Rigger, M.Constant optimization driven database system testing. Proc. ACM Manag. Data 3, 1 (Feb. 2025). [27]Zhong, S., and Rigger, M.Scaling automated database system testing, 2026

  25. [25]

    Zhou, J., Xu, M., Shraer, A., Namasivayam, B., Miller, A., Tschannen, E., Atherton, S., Beamon, A. J., Sears, R., Leach, J., Rosenthal, D., Dong, X., Wilson, W., Collins, B., Scherer, D., Grieser, A., Liu, Y., Moore, A., Muppana, B., Su, X., and Y adav, V.Foundationdb: A distributed unbundled transactional key value store. InProceedings of the 2021 Intern...