{"paper":{"title":"k-Means for Streaming and Distributed Big Sparse Data","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"","cross_cats":[],"primary_cat":"cs.DS","authors_text":"Artem Barger, Dan Feldman","submitted_at":"2015-11-29T10:06:11Z","abstract_excerpt":"We provide the first streaming algorithm for computing a provable approximation to the $k$-means of sparse Big data. Here, sparse Big Data is a set of $n$ vectors in $\\mathbb{R}^d$, where each vector has $O(1)$ non-zeroes entries, and $d\\geq n$. E.g., adjacency matrix of a graph, web-links, social network, document-terms, or image-features matrices.\n  Our streaming algorithm stores at most $\\log n\\cdot k^{O(1)}$ input points in memory. If the stream is distributed among $M$ machines, the running time reduces by a factor of $M$, while communicating a total of $M\\cdot k^{O(1)}$ (sparse) input po"},"claims":{"count":0,"items":[],"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"source":{"id":"1511.08990","kind":"arxiv","version":2},"verdict":{"id":null,"model_set":{},"created_at":null,"strongest_claim":"","one_line_summary":"","pipeline_version":null,"weakest_assumption":"","pith_extraction_headline":""},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}