Optimizations to Petastorm and Parquet data pipelines with caching and deterministic queues reduce large-scale deep learning training time by 6x while raising GPU utilization above 60% and eliminating run-to-run variance.
Apache Parquet,
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.DC 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Optimizing High-Throughput Distributed Data Pipelines for Reproducible Deep Learning at Scale
Optimizations to Petastorm and Parquet data pipelines with caching and deterministic queues reduce large-scale deep learning training time by 6x while raising GPU utilization above 60% and eliminating run-to-run variance.