OrpheusDB: Bolt-on Versioning for Relational Databases
Add this Pith Number to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{PXK5PZJR}
Prints a linked pith:PXK5PZJR badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
read the original abstract
Data science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for a system that can support dataset versioning, enabling such teams to efficiently store, track, and query across dataset versions. We introduce OrpheusDB, a dataset version control system that "bolts on" versioning capabilities to a traditional relational database system, thereby gaining the analytics capabilities of the database "for free". We develop and evaluate multiple data models for representing versioned data, as well as a light-weight partitioning scheme, LyreSplit, to further optimize the models for reduced query latencies. With LyreSplit, OrpheusDB is on average 1000x faster in finding effective (and better) partitionings than competing approaches, while also reducing the latency of version retrieval by up to 20x relative to schemes without partitioning. LyreSplit can be applied in an online fashion as new versions are added, alongside an intelligent migration scheme that reduces migration time by 10x on average.
This paper has not been read by Pith yet.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.