pith. the verified trust layer for science. sign in

arxiv: 1703.02475 · v1 · pith:PXK5PZJRnew · submitted 2017-03-07 · 💻 cs.DB

OrpheusDB: Bolt-on Versioning for Relational Databases

classification 💻 cs.DB
keywords datasetdatalyresplitorpheusdbsystemversioningversionsaverage
0
0 comments X p. Extension
Add this Pith Number to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{PXK5PZJR}

Prints a linked pith:PXK5PZJR badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

read the original abstract

Data science teams often collaboratively analyze datasets, generating dataset versions at each stage of iterative exploration and analysis. There is a pressing need for a system that can support dataset versioning, enabling such teams to efficiently store, track, and query across dataset versions. We introduce OrpheusDB, a dataset version control system that "bolts on" versioning capabilities to a traditional relational database system, thereby gaining the analytics capabilities of the database "for free". We develop and evaluate multiple data models for representing versioned data, as well as a light-weight partitioning scheme, LyreSplit, to further optimize the models for reduced query latencies. With LyreSplit, OrpheusDB is on average 1000x faster in finding effective (and better) partitionings than competing approaches, while also reducing the latency of version retrieval by up to 20x relative to schemes without partitioning. LyreSplit can be applied in an online fashion as new versions are added, alongside an intelligent migration scheme that reduces migration time by 10x on average.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.