Probabilistic Data Analysis with Probabilistic Programming

Feras Saad; Vikash Mansinghka

arxiv: 1608.05347 · v1 · pith:YZOHV6XUnew · submitted 2016-08-18 · 💻 cs.AI · cs.LG· stat.ML

Probabilistic Data Analysis with Probabilistic Programming

Feras Saad , Vikash Mansinghka This is my paper

classification 💻 cs.AI cs.LGstat.ML

keywords probabilisticanalysisdatacgpmsmodelscodelanguagelines

0 comments

read the original abstract

Probabilistic techniques are central to data analysis, but different approaches can be difficult to apply, combine, and compare. This paper introduces composable generative population models (CGPMs), a computational abstraction that extends directed graphical models and can be used to describe and compose a broad class of probabilistic data analysis techniques. Examples include hierarchical Bayesian models, multivariate kernel methods, discriminative machine learning, clustering algorithms, dimensionality reduction, and arbitrary probabilistic programs. We also demonstrate the integration of CGPMs into BayesDB, a probabilistic programming platform that can express data analysis tasks using a modeling language and a structured query language. The practical value is illustrated in two ways. First, CGPMs are used in an analysis that identifies satellite data records which probably violate Kepler's Third Law, by composing causal probabilistic programs with non-parametric Bayes in under 50 lines of probabilistic code. Second, for several representative data analysis tasks, we report on lines of code and accuracy measurements of various CGPMs, plus comparisons with standard baseline solutions from Python and MATLAB libraries.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Bayesian Synthesis of Probabilistic Programs for Automatic Data Modeling
cs.PL 2019-07 unverdicted novelty 6.0

Bayesian synthesis formulates automatic construction of probabilistic programs in PCFG-specified DSLs with soundness conditions, enabling structure analysis and prediction that outperforms baselines on real datasets.