DIRT: Database-Integrated Random Testing
Pith reviewed 2026-05-15 00:34 UTC · model grok-4.3
The pith
Integrating random testing directly into the DBMS reduces false positives and finds more bugs during active development.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DIRT integrates a random testing framework directly into the DBMS, allowing database developers to specify correctness properties through generation actions rather than relying on external testing experts. Because the testing infrastructure lives inside the engine, the random testing process can change whenever the database code changes, eliminating many sources of false positives by construction and producing bug reports that remain relevant to the current development state. Evaluation on Turso showed 23 unique confirmed bugs together with higher true-positive rates and more useful reports than external SQLancer variants.
What carries the argument
Generation actions: an abstraction that lets database developers encode correctness properties so random test inputs and oracles stay aligned with the current, possibly incomplete implementation.
If this is right
- Testing remains effective even when many features are incomplete or changing.
- Bug reports become more actionable because generated cases are tailored to the current engine state.
- Random testing can be introduced earlier in the development cycle without the usual flood of noise.
- The same integrated approach applies to other actively developed database engines beyond the Turso case study.
Where Pith is reading between the lines
- The integration pattern could transfer to testing other complex, stateful systems such as distributed services or kernels where external oracles quickly become stale.
- It may reduce dependence on separate testing specialists by letting implementers maintain the correctness model themselves.
- Automated extraction or synthesis of generation actions from schemas or existing code could further lower the effort required.
Load-bearing premise
Database developers can correctly and comprehensively specify correctness properties using generation actions without missing key cases or introducing new errors in the testing setup.
What would settle it
A head-to-head run on a mature, stable DBMS in which DIRT reports a false-positive rate equal to or higher than external tools such as SQLancer, or an implementation where generation actions miss known bugs that external oracles detect.
Figures
read the original abstract
Database management systems (DBMSs) are notoriously complex, making them difficult to test effectively, especially during early development when many features are incomplete. Traditional testing tools like SQLancer and SQLSmith are highly effective for mature databases, but they struggle with high false positive rates and low actionability when applied to evolving systems. We present DIRT, a paradigm designed specifically for testing databases during development, which integrates a testing framework directly into the DBMS, enabling the random testing process to evolve in tandem with the system and reducing false positives by construction. We introduce generation actions, an abstraction for allowing database developers rather than testing experts to specify correctness properties. We evaluate DIRT on Turso, an actively developed SQLite-compatible OLTP engine, and show that it finds 23 unique, confirmed bugs--significantly outperforming off-the-shelf SQLancer variants in terms of true positive rate and usefulness of bug reports. Our results demonstrate that embedding testing infrastructure within the DBMS can dramatically improve its effectiveness and usability during development.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces DIRT, a paradigm for integrating random testing infrastructure directly into a DBMS to support testing during early development. It defines generation actions as an abstraction allowing database developers to specify correctness properties, claims this reduces false positives by construction, and evaluates the approach on Turso (an actively developed SQLite-compatible OLTP engine), reporting 23 unique confirmed bugs and higher true-positive rates plus more actionable reports than off-the-shelf SQLancer variants.
Significance. If the empirical results and methodology hold, embedding testing within the DBMS could meaningfully improve bug detection effectiveness and usability for evolving systems where traditional external tools produce high false-positive rates. The concrete outcome of 23 confirmed bugs on Turso provides a practical demonstration that developer-specified generation actions can yield usable findings during active development.
major comments (2)
- [Evaluation section] Evaluation section: the abstract states that DIRT finds 23 unique confirmed bugs on Turso and outperforms SQLancer variants in true-positive rate, yet supplies no information on experimental controls, bug confirmation criteria, quantitative metrics (e.g., number of queries or runs), or potential confounds such as differing coverage or oracle strength. This absence makes the central claim of improved effectiveness difficult to assess or reproduce.
- [Generation actions definition] Generation actions definition (early sections): the claim that embedding reduces false positives by construction rests on the assumption that developers can comprehensively and correctly encode correctness properties via generation actions without omitting key cases or introducing new errors; the manuscript provides no discussion, validation, or counter-example analysis of this assumption despite it being load-bearing for the usability argument.
minor comments (1)
- [Abstract] The abstract would be clearer if it briefly indicated the scale of the Turso evaluation (e.g., number of test runs or total queries generated) to give readers an immediate sense of experimental effort.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below, indicating revisions where the manuscript will be updated.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section: the abstract states that DIRT finds 23 unique confirmed bugs on Turso and outperforms SQLancer variants in true-positive rate, yet supplies no information on experimental controls, bug confirmation criteria, quantitative metrics (e.g., number of queries or runs), or potential confounds such as differing coverage or oracle strength. This absence makes the central claim of improved effectiveness difficult to assess or reproduce.
Authors: We agree that the Evaluation section requires additional detail to support reproducibility and assessment of the claims. In the revised manuscript we expand this section with descriptions of the experimental controls (identical hardware and time budgets across tools), the bug confirmation process (reproduction by Turso developers plus independent verification), quantitative metrics (total queries generated and number of runs), and discussion of confounds including coverage measurement and oracle differences. These changes directly address the gaps noted. revision: yes
-
Referee: [Generation actions definition] Generation actions definition (early sections): the claim that embedding reduces false positives by construction rests on the assumption that developers can comprehensively and correctly encode correctness properties via generation actions without omitting key cases or introducing new errors; the manuscript provides no discussion, validation, or counter-example analysis of this assumption despite it being load-bearing for the usability argument.
Authors: The referee correctly identifies that the reduction in false positives by construction depends on developers specifying generation actions effectively. The original manuscript supplies concrete examples from Turso but lacks explicit analysis of the assumption. We have added a new subsection discussing design principles intended to limit specification errors, the iterative refinement possible during development, and a brief analysis of cases where initial actions were incomplete. We maintain the core claim relative to external tools but acknowledge this is an assumption requiring ongoing validation. revision: partial
Circularity Check
No significant circularity identified
full rationale
The paper describes an empirical system integration and evaluation on Turso, reporting 23 confirmed bugs and improved true-positive rates versus external SQLancer variants. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the provided material that would reduce any claimed result to its inputs by construction. The central claims rest on concrete bug reports and comparative metrics rather than internal reductions or ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption It is feasible to integrate a testing framework directly into the DBMS without prohibitive performance or complexity costs during active development.
invented entities (1)
-
generation actions
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Ba, J., and Rigger, M.Testing database engines via query plan guidance. In Proceedings of the 45th International Conference on Software Engineering(2023), ICSE ’23, IEEE Press, p. 2060–2071
work page 2023
-
[2]
Ba, J., and Rigger, M.Cert: Finding performance issues in database systems through the lens of cardinality estimation. InProceedings of the IEEE/ACM 46th International Conference on Software Engineering(New York, NY, USA, 2024), ICSE ’24, Association for Computing Machinery
work page 2024
- [3]
-
[4]
Böhme, M., Pham, V.-T., and Roychoudhury, A.Coverage-based greybox fuzzing as markov chain. InProceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security(New York, NY, USA, 2016), CCS ’16, Association for Computing Machinery, p. 1032–1043
work page 2016
-
[5]
Claessen, K., and Hughes, J.Quickcheck: a lightweight tool for random testing of haskell programs.SIGPLAN Not. 35, 9 (Sept. 2000), 268–279
work page 2000
-
[6]
In14th USENIX Workshop on Offensive Technologies (WOOT 20)(Aug
Fioraldi, A., Maier, D., Eissfeldt, H., and Heuse, M.AFL++: Combining incremental steps of fuzzing research. In14th USENIX Workshop on Offensive Technologies (WOOT 20)(Aug. 2020), USENIX Association
work page 2020
-
[7]
Fu, J., Liang, J., Wu, Z., W ang, M., and Jiang, Y.Griffin : Grammar-free dbms fuzzing. InProceedings of the 37th IEEE/ACM International Conference on Auto- mated Software Engineering(New York, NY, USA, 2023), ASE ’22, Association for Computing Machinery
work page 2023
-
[8]
Fu, Y., Wu, Z., Zhang, Y., Liang, J., Fu, J., Jiang, Y., Li, S., and Liao, X.Thanos: DBMS Bug Detection via Storage Engine Rotation Based Differential Testing . In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE)(Los Alamitos, CA, USA, May 2025), IEEE Computer Society, pp. 655–666
work page 2025
-
[9]
Goldstein, H., Cutler, J. W., Dickstein, D., Pierce, B. C., and Head, A. Property-based testing in practice. InProceedings of the IEEE/ACM 46th In- ternational Conference on Software Engineering(New York, NY, USA, 2024), ICSE ’24, Association for Computing Machinery. [10]Hipp, R. D.SQLite, 2020
work page 2024
-
[10]
Hritcu, C., Hughes, J., Pierce, B. C., Spector-Zabusky, A., Vytiniotis, D., Azevedo de Amorim, A., and Lampropoulos, L.Testing noninterference, quickly. SIGPLAN Not. 48, 9 (Sept. 2013), 455–468
work page 2013
-
[11]
Hughes, J.How to specify it! a guide to writing properties of pure functions. In Trends in Functional Programming: 20th International Symposium, TFP 2019, Van- couver, BC, Canada, June 12–14, 2019, Revised Selected Papers(Berlin, Heidelberg, 2019), Springer-Verlag, p. 58–83
work page 2019
-
[12]
Jung, J., Hu, H., Arulraj, J., Kim, T., and Kang, W.APOLLO: automatic detection and diagnosis of performance regressions in database systems.PVLDB 13, 1 (2019), 57–70
work page 2019
-
[13]
C.Coverage guided, property based testing.Proc
Lampropoulos, L., Hicks, M., and Pierce, B. C.Coverage guided, property based testing.Proc. ACM Program. Lang. 3, OOPSLA (Oct. 2019)
work page 2019
-
[14]
C.Generating good generators for inductive relations.Proc
Lampropoulos, L., Paraskevopoulou, Z., and Pierce, B. C.Generating good generators for inductive relations.Proc. ACM Program. Lang. 2, POPL (Dec. 2017)
work page 2017
-
[15]
Lemieux, C., Padhye, R., Sen, K., and Song, D.Perffuzz: automatically generat- ing pathological inputs. InProceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis(New York, NY, USA, 2018), ISSTA 2018, Association for Computing Machinery, p. 254–265
work page 2018
-
[16]
InProceedings of the 31st USENIX Security Symposium (USENIX 2022)(Boston, MA, aug 2022)
Liang, Y., Liu, S., and Hu, H.Detecting Logical Bugs of DBMS with Coverage- based Guidance. InProceedings of the 31st USENIX Security Symposium (USENIX 2022)(Boston, MA, aug 2022)
work page 2022
-
[17]
Padhye, R., Lemieux, C., Sen, K., Papadakis, M., and Le Traon, Y.Semantic fuzzing with zest. InProceedings of the 28th ACM SIGSOFT International Sympo- sium on Software Testing and Analysis(New York, NY, USA, 2019), ISSTA 2019, Association for Computing Machinery, p. 329–340
work page 2019
-
[18]
Padhye, R., Lemieux, C., Sen, K., Simon, L., and Vijayakumar, H.Fuzzfactory: domain-specific fuzzing with waypoints.Proc. ACM Program. Lang. 3, OOPSLA (Oct. 2019)
work page 2019
-
[19]
Rigger, M., and Su, Z.Detecting optimization bugs in database engines via non-optimizing reference engine construction. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(New York, NY, USA, 2020), ESEC/FSE 2020, Association for Computing Machinery, p. 1140–1152
work page 2020
-
[20]
Rigger, M., and Su, Z.Finding bugs in database systems via query partitioning. Proc. ACM Program. Lang. 4, OOPSLA (Nov. 2020)
work page 2020
-
[21]
In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)(Nov
Rigger, M., and Su, Z.Testing database engines via pivoted query synthesis. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20)(Nov. 2020), USENIX Association, pp. 667–682
work page 2020
-
[22]
Rigger, M., and Su, Z.Testing database engines via pivoted query synthesis. InProceedings of the 14th USENIX Conference on Operating Systems Design and Implementation(USA, 2020), OSDI’20, USENIX Association. [24]Seltenreich, A.Sqlsmith. https://github.com/anse1/sqlsmith, 2015
work page 2020
-
[23]
CISPA Helmholtz Center for Information Security, 2024
Zeller, A., Gopinath, R., Böhme, M., Fraser, G., and Holler, C.The Fuzzing Book. CISPA Helmholtz Center for Information Security, 2024. Retrieved 2024- 07-01 16:50:18+02:00. Conference’17, July 2017, Washington, DC, USA Alperen Keles, Ethan Chou, Harrison Goldstein, and Leonidas Lampropoulos
work page 2024
-
[24]
Zhang, C., and Rigger, M.Constant optimization driven database system testing. Proc. ACM Manag. Data 3, 1 (Feb. 2025). [27]Zhong, S., and Rigger, M.Scaling automated database system testing, 2026
work page 2025
-
[25]
Zhou, J., Xu, M., Shraer, A., Namasivayam, B., Miller, A., Tschannen, E., Atherton, S., Beamon, A. J., Sears, R., Leach, J., Rosenthal, D., Dong, X., Wilson, W., Collins, B., Scherer, D., Grieser, A., Liu, Y., Moore, A., Muppana, B., Su, X., and Y adav, V.Foundationdb: A distributed unbundled transactional key value store. InProceedings of the 2021 Intern...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.