Hillview: A trillion-cell spreadsheet for big data
Pith reviewed 2026-05-24 23:30 UTC · model grok-4.3
The pith
Hillview lets users interactively explore spreadsheets with trillions of cells on just eight servers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Hillview shows that visualization sketches called vizketches can scale spreadsheet interactivity to tens of billions of rows and trillions of cells by parallelizing computation across servers, reducing communication, supporting progressive rendering, and providing precise accuracy guarantees.
What carries the argument
Vizketches: compact visualizations that combine algorithmic data summarization with computer graphics rendering principles to enable low-latency, accurate displays.
If this is right
- Users can switch between many visualizations without reloading data.
- Exploration remains feasible on datasets far larger than main memory.
- Accuracy guarantees let analysts trust the displayed summaries for decisions.
- Computation parallelizes across a modest number of servers.
- Progressive rendering gives immediate feedback while full precision arrives.
Where Pith is reading between the lines
- The same sketching approach might support live updates from streaming sources if incremental maintenance is added.
- Similar compact summaries could improve interactivity in other visual analytics tools such as geographic maps or network diagrams.
- Accuracy bounds might allow automatic query optimization by choosing sketch granularity based on display resolution.
Load-bearing premise
Vizketches can be computed and rendered with low enough latency and communication cost to preserve spreadsheet-style interactivity on arbitrary real-world data.
What would settle it
Measure end-to-end latency for a sequence of arbitrary user queries on a trillion-cell dataset and check whether response times stay under a few seconds with the published accuracy guarantees.
Figures
read the original abstract
Hillview is a distributed spreadsheet for browsing very large datasets that cannot be handled by a single machine. As a spreadsheet, Hillview provides a high degree of interactivity that permits data analysts to explore information quickly along many dimensions while switching visualizations on a whim. To provide the required responsiveness, Hillview introduces visualization sketches, or vizketches, as a simple idea to produce compact data visualizations. Vizketches combine algorithmic techniques for data summarization with computer graphics principles for efficient rendering. While simple, vizketches are effective at scaling the spreadsheet by parallelizing computation, reducing communication, providing progressive visualizations, and offering precise accuracy guarantees. Using Hillview running on eight servers, we can navigate and visualize datasets of tens of billions of rows and trillions of cells, much beyond the published capabilities of competing systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. Hillview is a distributed spreadsheet system for interactive exploration of datasets too large for a single machine. It introduces vizketches—compact visualization sketches that combine data summarization algorithms with graphics rendering principles—to enable parallel computation, reduced communication, progressive rendering, and accuracy guarantees while preserving spreadsheet-style interactivity. The central empirical claim is that the system, running on eight servers, supports navigation and visualization of tens of billions of rows and trillions of cells, exceeding published capabilities of competing systems.
Significance. If the reported scaling and latency results hold under the stated conditions, the work provides a concrete demonstration that spreadsheet interactivity can be extended to trillion-cell scales via targeted summarization techniques. This has potential impact on big-data analytics tools by showing how algorithmic sketches can be integrated with rendering to maintain responsiveness without sacrificing accuracy guarantees. The emphasis on progressive visualizations and precise error bounds is a constructive contribution to distributed systems for data exploration.
major comments (1)
- [abstract] The central scaling claim (abstract) rests on empirical measurements of vizketches under real-world query workloads, yet the provided text supplies no experimental section, baselines, hardware details, or error-bar information. Without these, the load-bearing assumption that vizketches deliver low-latency interactivity on arbitrary data distributions cannot be evaluated.
Simulated Author's Rebuttal
We thank the referee for the detailed review and the opportunity to clarify aspects of our work. We address the single major comment below, pointing to the relevant sections of the full manuscript.
read point-by-point responses
-
Referee: [abstract] The central scaling claim (abstract) rests on empirical measurements of vizketches under real-world query workloads, yet the provided text supplies no experimental section, baselines, hardware details, or error-bar information. Without these, the load-bearing assumption that vizketches deliver low-latency interactivity on arbitrary data distributions cannot be evaluated.
Authors: The full manuscript includes Section 6 (Evaluation), which provides the requested details: hardware specifications for the eight-server cluster, descriptions of real-world workloads and datasets (including navigation and visualization tasks on tens of billions of rows), direct baselines against competing systems such as Spark-based tools and other distributed visualization frameworks, measured latencies, and accuracy guarantees with error bounds for the vizketches. These experiments support the abstract's scaling claims under the tested conditions. The abstract is a concise summary and does not duplicate the full experimental methodology or results, which appear in the body of the paper. revision: no
Circularity Check
No significant circularity
full rationale
This is a systems paper describing an implementation of a distributed spreadsheet (Hillview) and its vizketches mechanism. The abstract and provided text contain no equations, derivations, fitted parameters, or load-bearing self-citations that reduce a claimed result to its own inputs by construction. Performance claims rest on empirical measurements rather than any self-referential mathematical chain. No instances of the enumerated circularity patterns are present.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Vizketches combine algorithmic techniques for data summarization with computer graphics principles for efficient rendering... compute only what you can display.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The summarize function outputs a vector of B bin counts, and the merge function adds two vectors.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
IMPLEMENTATION Hillview consists of 35000 lines of Java and 16000 lines of TypeScript code. The user interface in the browser is implemented in TypeScript [95], using parts of the D3 JavaScript library [11]. Graphics is done using SVG [25]. The web server runs the Apache Tomcat application server [4]. The browser gets progressive replies from web server u...
-
[2]
EV ALUATION Our evaluation goal is to determine whether Hillview provides interactive performance with large data sets, how Hillview com- pares to existing systems, how vizketches contribute to that goal, and how effective the spreadsheet is. Summary. We find the following results: • Hillview can handle spreadsheets with 130B rows and 1.4T cells using only...
work page 2000
-
[3]
overview first, zoom and filter, details on demand
RELATED WORK Hillview is the first spreadsheet to scale massively with in- teractive speed. Hillview borrows ideas from the algorithms and computer graphics literature, namely mergeable summaries [2] (or sketches) and visualization-driven computation; it uses relies on many techniques from databases (approximate query processing, on-line analytics), big-da...
-
[4]
CONCLUSION Hillview is a spreadsheet that supports a trillion cells even with a modest number of servers. Hillview introduces a new query ex- ecution engine specialized to render tabular views and charts for a spreadsheet. The new engine uses vizketches, a new but simple idea that parallelizes computation and calculates only what is needed for a good visu...
-
[5]
L. Abraham, J. Allen, O. Barykin, V . R. Borkar, B. Chopra, C. Gerea, D. Merl, J. Metzler, D. Reiss, S. Subramanian, J. L. Wiener, and O. Zed. Scuba: Diving into data at Facebook. PVLDB, 6(11):1057–1067, 2013
work page 2013
-
[6]
P. K. Agarwal, G. Cormode, Z. Huang, J. Phillips, Z. Wei, and K. Yi. Mergeable summaries. In ACM SIGMOD International conference on Management of data, pages 23–34, 2012
work page 2012
-
[7]
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: Queries with bounded errors and bounded response times on very large data. In European Conference on Computer Systems (EuroSys), Prague, Czech Republic, 2013
work page 2013
-
[8]
Apache Tomcat. http://tomcat.apache.org. Retrieved March 2019
work page 2019
-
[9]
M. Barnett, B. Chandramouli, R. DeLine, S. Drucker, D. Fisher, J. Goldstein, P. Morrison, and J. Platt. Stat!: an interactive analytics environment for big data. In ACM SIGMOD International conference on Management of data, pages 1013–1016, 2013
work page 2013
- [10]
- [11]
-
[12]
M. Behrisch, D. Streeb, F. Stoffel, D. Seebacher, B. Matejek, S. H. Weber, S. Mittelstaedt, H. Pfister, and D. Keim. Commercial visual analytics systems – advances in the big data analytics field.IEEE Transactions on Visualization and Computer Graphics, 2018
work page 2018
-
[13]
N. Bikakis. Big data visualization tools. In S. Sakr and A. Zomaya, editors, Encyclopedia of Big Data Technologies, pages 1–6. Springer International Publishing, Cham, 2018
work page 2018
-
[14]
N. Bikakis, G. Papastefanatos, M. Skourla, and T. Sellis. A hierarchical aggregation framework for efficient multilevel visual exploration and analysis. Semantic Web, 8(1):139–179, 2017
work page 2017
-
[15]
M. Bostock, V . Ogievetsky, and J. Heer. D3: Data-driven documents. IEEE Trans. Visualization and Comp. Graphics (Proc. InfoVis), 2011
work page 2011
-
[16]
M. Brown. BigSheets for the common man. https://www.ibm.com/developerworks/library/bd-bigsheets/index.html, December 2013
work page 2013
- [17]
- [18]
-
[19]
S. Chaudhuri, G. Das, and V . Narasayya. A robust, optimization-based approach for approximate answering of aggregate queries. In ACM SIGMOD International conference on Management of data, pages 295–306, 2001
work page 2001
-
[20]
J. Choo, C. Lee, H. Kim, H. Lee, C. Reddy, B. Drake, and H. Park. PIVE: Per-iteration visualization environment for supporting real-time interactions with computational methods. In Visual Analytics Science and Technology (VAST), 2014
work page 2014
-
[21]
R. Christopher and V . Krishnan. Optimizing your Amazon Redshift and Tableau software deployment for better performance v2. https://www.tableau.com/sites/default/files/ whitepapers/optimizing tableau aws redshift whitepaper v2.pdf, 2017
work page 2017
-
[22]
L. Chu, H. Tang, T. Yang, and K. Shen. Optimizing data aggregation for cluster-based Internet services. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 119–130, 2003
work page 2003
-
[23]
E. Cohen and H. Kaplan. Summarizing data using bottom-k sketches. In ACM Symposium on Principles of Distributed Computing (PODC), pages 225–234, New York, NY , USA,
- [24]
-
[25]
G. Cormode. Data sketching. Communications of the ACM, 60(9):48–55, Aug. 2017
work page 2017
- [26]
- [27]
- [28]
-
[29]
E. Dahlström, P. Dengler, A. Grasso, C. Lilley, C. McCormack, D. Schepers, J. Watt, J. Ferraiolo, F. Jun, and D. Jackson. Scalable vector graphics (SVG) 1.1. https://www.w3.org/TR/SVG/, August 2011
work page 2011
- [30]
-
[31]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Symposium on Operating System Design and Implementation (OSDI), San Francisco, CA, December 2004
work page 2004
-
[32]
Ç. Demiralp, P. J. Haas, S. Parthasarathy, and T. Pedapati. Foresight: Recommending visual insights. PVLDB, 10(12):1937–1940, 2017
work page 1937
-
[33]
B. Ding, S. Huang, S. Chaudhuri, K. Chakrabarti, and C. Wang. Sample + seek: Approximating aggregates with distribution precision guarantee. In ACM SIGMOD International conference on Management of data, pages 679–694, 2016
work page 2016
- [34]
-
[35]
M. El-Hindi, Z. Zhao, C. Binnig, and T. Kraska. VisTrees: fast indexes for interactive data exploration. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (HILDA), page 5, 2016
work page 2016
- [36]
-
[37]
N. Elmqvist and J. Fekete. Hierarchical aggregation for information visualization: Overview, techniques, and design guidelines. IEEE Transactions on Visualization and Computer Graphics, 16(3):439–454, May 2010
work page 2010
-
[38]
:::fastutil: Fast and compact type-specific collections for Java. http://fastutil.di.unimi.it. Retrieved October 2017
work page 2017
-
[39]
Progressive Analytics: A Computation Paradigm for Exploratory Data Analysis
J.-D. Fekete and R. Primet. Progressive analytics: A computation paradigm for exploratory data analysis. https://arxiv.org/abs/1607.05162, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[40]
J. Feldman, S. Muthukrishnan, A. Sidiropoulos, C. Stein, and Z. Svitkina. On distributing symmetric streaming computations. ACM Trans. Algorithms, 6(4):66:1–66:19, 2010
work page 2010
-
[41]
I. Fette and A. Melnikov. The WebSocket protocol. IETF RFC 6455, December 2001
work page 2001
-
[42]
D. Fisher. Big data exploration requires collaboration between visualization and data infrastructures. In Human-In-the-Loop Data Analytics (HILDA), pages 16:1–16:5, 2016
work page 2016
- [43]
-
[44]
P. Flajolet, Éric Fusy, O. Gandouet, and F. Meunier. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. In Conference on Analysis of Algorithms (AofA) DMTCS proc., pages 127–146, 2007
work page 2007
- [45]
-
[46]
P. Godfrey, J. Gryz, and P. Lasek. Interactive visualization of large data sets. IEEE Transactions on Knowledge and Data Engineering, 28(8):2142–2157, 2016
work page 2016
-
[47]
P. Godfrey, J. Gryz, P. Lasek, and N. Razavi. Visualization through inductive aggregation. In International Conference on Extending Database Technology (EDBT), pages 600–603, 2016
work page 2016
-
[48]
gRPC: A high performance, open-source universal RPC framework. https://grpc.io/. Retrieved October 2017
work page 2017
-
[49]
A. Hall, O. Bachmann, R. Büssow, S. G ˘anceanu, and M. Nunkesser. Processing a trillion cells per mouse click. PVLDB, 5(11):1436–1446, July 2012
work page 2012
-
[50]
M. Hausenblas and J. Nadeau. Apache Drill: Interactive ad-hoc analysis at scale. IEEE Comput. Graph. Appl., 1(2), June 2013
work page 2013
-
[51]
J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In ACM SIGMOD International conference on Management of data, pages 171–182, 1997
work page 1997
-
[52]
J. F. Hughes, A. van Dam, M. McGuide, D. F. Sklar, J. D. Foley, S. K. Feiner, and K. Akeley.Computer Graphics: Principles and Practice (3rd Edition). Addison-Wesley Professional, 2013
work page 2013
-
[53]
J.-F. Im, K. Gopalakrishna, S. Subramaniam, M. Shrivastava, A. Tumbde, X. Jiang, J. Dai, S. Lee, N. Pawar, J. Li, and R. Aringunram. Pinot: Realtime OLAP for 530 million users. In International Conference on Management of Data (SIGMOD), pages 583–594, 2018
work page 2018
-
[54]
J.-F. Im, F. G. Villegas, and M. J. McGuffin. VisReduce: Fast and responsive incremental information visualization of large datasets. In IEEE International Conference on Big Data, pages 25–32, Oct 2013
work page 2013
-
[55]
J. Jo, W. Kim, S. Yoo, B. Kim, and J. Seo. SwiftTuna: Incrementally exploring large-scale multidimensional data. In IEEE VIS, Phoenix, AZ, October 2016
work page 2016
-
[56]
J. Jo, W. Kim, S. Yoo, B. Kim, and J. Seo. SwiftTuna: Responsive and incremental visual exploration of large-scale multidimensional data. In Pacific Visualization Symposium (PacificVis), pages 131–140, Seoul, Korea, 2017
work page 2017
- [57]
- [58]
- [59]
-
[60]
N. Kamat and A. Nandi. A session-based approach to fast-but-approximate interactive data cube exploration. ACM Trans. Knowl. Discov. Data, 12(1):1–26, Feb. 2018
work page 2018
- [61]
-
[62]
A. Kim, E. Blais, A. Parameswaran, P. Indyk, S. Madden, and R. Rubinfeld. Rapid sampling for visualizations with ordering guarantees. PVLDB, 8(5):521–532, Jan. 2015
work page 2015
-
[63]
A. Kim, L. Xu, T. Siddiqui, S. Huang, S. Madden, and A. Parameswaran. Optimally leveraging density and locality for exploratory browsing and sampling. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics (HILDA 18), HILDA, pages 7:1–7:7, 2018
work page 2018
-
[64]
M. Kornacker, A. Behm, V . Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs, I. Joshi, L. Kuff, D. Kumar, A. Leblang, N. Li, I. Pandis, H. Robinson, D. Rorke, S. Rus, J. Russell, D. Tsirogiannis, S. Wanderman-Milne, and M. Yoder. Impala: A modern, open-source SQL engine for Hadoop. In Conference on Innovative Data Sys...
work page 2015
- [65]
-
[66]
L. Lins, J. T. Klosowski, and C. Scheidegger. Nanocubes for real-time exploration of spatiotemporal datasets. IEEE Transactions on Visualization and Computer Graphics, 19(12):2456–2465, 2013
work page 2013
-
[67]
Z. Liu, B. Jiang, and J. Heer. imMens: Real-time visual querying of big data. Computer Graphics Forum (Proc. EuroVis), 32, 2013
work page 2013
-
[68]
E. Meijer. Your mouse is a database. ACM Queue, 10(3):20–33, Mar. 2012
work page 2012
- [69]
-
[70]
Microsoft Corp. Tempe. http://research.microsoft.com/en-us/projects/tempe/. Retrieved January 2019
work page 2019
-
[71]
Microsoft PowerBI. https://powerbi.microsoft.com. Accessed October 2017
work page 2017
-
[72]
J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:143–152, 1982
work page 1982
- [73]
-
[74]
S. Muthukrishnan. Data Streams: Algorithms and Applications. Foundations and trends in theoretical computer science. Now Publishers, 2005
work page 2005
-
[75]
U. D. of Transportation. Airline on-time performance data. https://transtats.bts.gov/Tables.asp?DB ID=120. Retrieved January 2019
work page 2019
-
[76]
https://www.omnisci.com, Retrieved October 2018
OmniSci is the extreme analytics platform. https://www.omnisci.com, Retrieved October 2018
work page 2018
-
[77]
Oracle Corp. Project Nashorn. http://openjdk.java.net/projects/nashorn/. Retrieved February 2018
work page 2018
-
[78]
C. A. L. Pahins, S. A. Stephens, C. Scheidegger, and J. L. D. Comba. Hashedcubes: Simple, low memory, real-time visual exploration of big data. IEEE Transactions on Visualization and Computer Graphics, 23(1):671–680, 2017
work page 2017
-
[79]
N. Pansare, V . R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large MapReduce jobs. In PVLDB, Seattle, W A, August 2011
work page 2011
-
[80]
Y . Park, M. Cafarella, and B. Mozafari. Visualization-aware sampling for very large databases. In International Conference on Data Engineering (ICDE), pages 755–766. IEEE, 2016
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.