arxiv: 1807.03078 · v2 · pith:2H5VL6XGnew · submitted 2018-07-09 · 🌌 astro-ph.IM · astro-ph.CO

Analyzing billion-objects catalog interactively: Apache Spark for physicists

S. Plaszczynski , J. Peloton , C. Arnault , J.E. Campagne This is my paper

classification 🌌 astro-ph.IM astro-ph.CO

keywords ordersparkanalysesapachedatadatasetsdesigninteractively

0 comments p. Extension

Add this Pith Number to your LaTeX paper

\usepackage{pith}
\pithnumber{2H5VL6XG}

Prints a linked pith:2H5VL6XG badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

read the original abstract

Apache Spark is a Big Data framework for working on large distributed datasets. Although widely used in the industry, it remains rather limited in the academic community or often restricted to software engineers. The goal of this paper is to show with practical uses-cases that the technology is mature enough to be used without excessive programming skills by astronomers or cosmologists in order to perform standard analyses over large datasets, as those originating from future galaxy surveys. To demonstrate it, we start from a realistic simulation corresponding to 10 years of LSST data taking (6 billions of galaxies). Then, we design, optimize and benchmark a set of Spark python algorithms in order to perform standard operations as adding photometric redshift errors, measuring the selection function or computing power spectra over tomographic bins. Most of the commands execute on the full 110 GB dataset within tens of seconds and can therefore be performed interactively in order to design full-scale cosmological analyses. A jupyter notebook summarizing the analysis is available at https://github.com/astrolabsoftware/1807.03078.

This paper has not been read by Pith yet.

Analyzing billion-objects catalog interactively: Apache Spark for physicists

discussion (0)