Inference-Sufficient Representations for High-Throughput Measurement: Lessons from Lossless Compression Benchmarks in 4D-STEM
Pith reviewed 2026-05-15 00:27 UTC · model grok-4.3
The pith
4D-STEM datasets can be losslessly compressed by more than ten times using standard algorithms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Lossless compression of 4D-STEM data routinely achieves factors greater than 10 times; blosc_zstd matches the compression ratio of gzip-9 (mean 13.5 times versus 12.3 times) while compressing 19 to 69 times faster and reading 1.9 to 2.6 times faster, with ratios that follow a power law in sparsity (R squared equals 0.99).
What carries the argument
Systematic timing and ratio benchmarks of thirteen HDF5-compatible lossless compressors on five datasets, which reveal that compression ratio is a deterministic power-law function of measured sparsity.
If this is right
- 4D-STEM data can be routinely compressed by more than 10 times without loss of measurement values.
- blosc_zstd and five other top implementations give the best speed-ratio trade-off for different workflow needs.
- Compression and decompression times are highly reproducible (coefficient of variation less than 2 percent).
- Sparsity alone predicts compression performance across dataset sizes.
Where Pith is reading between the lines
- Interactive visualization and remote transfer of 4D-STEM data become practical once these compressors are adopted as defaults.
- Detector-rate growth will force experimenters to replace raw-density storage with formats that retain only the intensities needed for a given scientific inference.
- The power-law dependence suggests that any future 4D-STEM acquisition strategy that increases sparsity will automatically improve compressibility.
Load-bearing premise
The five selected datasets capture the range of sparsity and file sizes that will appear in future high-throughput 4D-STEM experiments.
What would settle it
Collect a new 4D-STEM dataset whose sparsity lies outside the 49.5-92.8 percent interval or whose size exceeds 8 GiB, apply the same compressors, and test whether the measured ratios deviate from the fitted power-law prediction.
Figures
read the original abstract
Four-dimensional scanning transmission electron microscopy (4D-STEM) generates multi-gigabyte datasets, creating a growing mismatch between acquisition rates and practical storage, transfer, and interactive visualization capabilities. We systematically benchmark 13 lossless compression implementations across 5 representative datasets (8~MiB to 8~GiB, 49.5--92.8\% sparsity), with 10 independent runs per method. HDF5 provides built-in gzip compression, of which gzip-9 typically achieves the highest compression ratio but is slow. We therefore evaluate widely available alternatives (via hdf5plugin), including the Blosc family. As a representative comparison, blosc\_zstd achieves compression comparable to gzip-9 (mean 13.5$\times$ vs 12.3$\times$) while compressing 19--69$\times$ faster and reading 1.9--2.6$\times$ faster across datasets. Compression ratios are deterministic, and timing measurements are highly reproducible (CV $<$2\%). Compression performance follows a power law with sparsity ($R^2 = 0.99$), ranging from 5$\times$ for moderately sparse data to 35$\times$ for highly sparse data. We identify six top-performing implementations optimized for different use cases and demonstrate that 4D-STEM data can be routinely compressed by $>$10$\times$. While these results provide practical guidance for lossless compression selection, the broader conclusion is that lossless compression preserves measurements but does not by itself guarantee sustainable high-throughput workflows. As detector rates rise, data handling will increasingly require inference-driven representations -- i.e., deciding what must be preserved to support a scientific inference, rather than defaulting to storing fully dense raw measurements.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper benchmarks 13 lossless compression implementations on 5 representative 4D-STEM datasets (8 MiB to 8 GiB, 49.5-92.8% sparsity) using 10 independent runs per method. It reports that blosc_zstd achieves mean compression ratios comparable to gzip-9 (13.5× vs 12.3×) while being 19-69× faster to compress and 1.9-2.6× faster to read, with all ratios deterministic and timings highly reproducible (CV <2%). Compression ratio follows an observed power-law relation with sparsity (R²=0.99), ranging from 5× to 35×, and the work concludes that 4D-STEM data can be routinely compressed by >10× while advocating that lossless compression alone is insufficient for sustainable high-throughput workflows and that inference-sufficient representations will be needed.
Significance. If the empirical benchmarks hold, the work supplies reproducible, practical guidance for compression codec selection in high-throughput 4D-STEM, with the power-law fit offering a simple predictive relation based on sparsity. The low-variability timing data and explicit comparison of speed versus ratio strengthen the case for adopting faster alternatives to gzip-9. The broader point that raw lossless storage will not scale indefinitely is well-taken and points toward needed future work on inference-driven data reduction.
major comments (1)
- [Abstract] Abstract: The claim that the benchmarks demonstrate 4D-STEM data 'can be routinely compressed by >10×' is not supported by the reported measurements. The power-law fit yields only 5× at 49.5% sparsity, the lowest value in the tested range, yet the abstract presents >10× as routine without additional argument that future high-throughput experiments will avoid this moderate-sparsity regime.
Simulated Author's Rebuttal
We thank the referee for their careful review and constructive comment on the abstract. We agree that the claim of routine >10× compression requires qualification in light of the observed sparsity dependence, and we will revise the manuscript accordingly to strengthen the presentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that the benchmarks demonstrate 4D-STEM data 'can be routinely compressed by >10×' is not supported by the reported measurements. The power-law fit yields only 5× at 49.5% sparsity, the lowest value in the tested range, yet the abstract presents >10× as routine without additional argument that future high-throughput experiments will avoid this moderate-sparsity regime.
Authors: We thank the referee for identifying this point. The power-law fit (R² = 0.99) indeed predicts approximately 5× compression at 49.5% sparsity and reaches 10× near 60% sparsity. Our five datasets span 49.5–92.8% sparsity, with four of them yielding >10×, but the abstract does not explicitly address why moderate-sparsity cases may be less representative of future high-throughput 4D-STEM experiments. We will revise the abstract to state that ratios exceed 10× for sparsities above ~60% (per the fitted relation) and add a short clause noting that typical 4D-STEM datasets in the literature often operate above this threshold, while retaining the full range of measured results for transparency. revision: yes
Circularity Check
No circularity: all central results are direct empirical measurements and observed fits
full rationale
The paper reports benchmark measurements of compression ratios, speeds, and reproducibility (CV <2%) across five fixed datasets, plus a post-hoc power-law fit to the observed sparsity-ratio pairs (R²=0.99). No derivation chain exists that reduces a claimed prediction or first-principles result to its own inputs by construction. There are no self-definitional equations, fitted parameters renamed as predictions, load-bearing self-citations, uniqueness theorems, or smuggled ansatzes. The >10× routine-compression statement is an interpretive summary of the measured range (5×–35×), not a mathematical reduction; any tension with the lowest-sparsity case is a question of representativeness, not circularity. The analysis is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- power-law parameters for sparsity-compression relation
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Compression performance follows a power law with sparsity (R² = 0.99), ranging from 5× for moderately sparse data to 35× for highly sparse data.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Colin Ophus. Four-dimensional scanning transmission electron microscopy (4D-STEM): From scanning nanodiffraction to ptychography and beyond.Microscopy and Microanalysis, 25(3):563–582, 2019
work page 2019
-
[2]
X. Llopart, J. Alozy, R. Ballabriga, M. Campbell, R. Casanova, V. Gromov, E.H.M. Heijne, T. Poikela, E. Santin, V. Sriskaran, L. Tlustos, and A. Vitkovskiy. Timepix4, a large area pixel detector readout chip which can be tiled on 4 sides providing sub-200 ps timestamp binning.Journal of Instrumentation, 17(01):C01044, January 2022
work page 2022
-
[3]
Jonathan Correa, Alexandr Ignatenko, David Pennicard, Sabine Lange, Sergei Fridman, Sebastian Karl, Leon Lohse, Bj¨ orn Senfftleben, Ilya Sergeev, Sven Velten, Deepak Prajapat, Lars Bocklage, Huber- tus Bromberger, Andrey Samartsev, Aleksandr Chumakov, Rudolf R¨ uffer, Joachim von Zanthier, Ralf R¨ ohlsberger, and Heinz Graafsma. TEMPUS, a Timepix4-based ...
work page 2024
-
[4]
F. Leonarski, M. Br¨ uckner, C. Lopez-Cuenca, A. Mozzanica, H.-C. Stadler, Z. Matˇ ej, A. Castel- lane, B. Mesnet, J. A. Wojdyla, B. Schmitt, and M. Wang. Jungfraujoch: Hardware-accelerated data-acquisition system for kilohertz pixel-array X-ray detectors.Journal of Synchrotron Radiation, 30(1):227–234, January 2023
work page 2023
-
[5]
https://www.dectris.com/en/detectors/electron-detectors/for-materials-science/arina/
DECTRIS ARINA - Hybrid-pixel detector for 4D STEM applications. https://www.dectris.com/en/detectors/electron-detectors/for-materials-science/arina/
-
[6]
https://www.dectris.com/en/detectors/electron- detectors/for-materials-science/quadro/
Direct Electron Detection|Microed|DQE Detector. https://www.dectris.com/en/detectors/electron- detectors/for-materials-science/quadro/
-
[7]
Pelz, Ian Johnson, Colin Ophus, Peter Ercius, and Mary C
Philipp M. Pelz, Ian Johnson, Colin Ophus, Peter Ercius, and Mary C. Scott. Real-Time Interactive 4D-STEM Phase-Contrast Imaging From Electron Event Representation Data: Less computation with the right representation.IEEE Signal Processing Magazine, 39(1):25–31, January 2022
work page 2022
-
[8]
Electron Energy Loss Spectroscopy|MerlinEELS
- [9]
-
[10]
EMD 1.0 and ‘emdfile‘: An HDF5 / Python Interface
Benjamin H Savitzky, Steven E Zeltmann, Alexandra Bruefach, Alexander Rakowski, Mary Scott, Matthew L Henderson, and Colin Ophus. EMD 1.0 and ‘emdfile‘: An HDF5 / Python Interface. Microscopy and Microanalysis, 29(Supplement 1):721–723, August 2023
work page 2023
-
[11]
DEFLATE compressed data format specification version 1.3
Peter Deutsch. DEFLATE compressed data format specification version 1.3. Technical report, 1996. 13
work page 1996
-
[12]
An overview of the HDF5 technology suite and its applications
Mike Folk, Gerd Heber, Quincey Koziol, Elena Pourmal, and Dana Robinson. An overview of the HDF5 technology suite and its applications. InProceedings of the EDBT/ICDT 2011 Workshop on Array Databases, AD ’11, pages 36–47, New York, NY, USA, March 2011. Association for Computing Machinery
work page 2011
-
[13]
https://support.hdfgroup.org/documentation/
Documentation. https://support.hdfgroup.org/documentation/
-
[14]
Francesc Alted. Why Modern CPUs Are Starving and What Can Be Done about It.Computing in Science & Engineering, 12(2):68–71, March 2010
work page 2010
-
[15]
Zstandard Compression and the application/zstd Media Type
Yann Collet and Murray Kucherawy. Zstandard Compression and the application/zstd Media Type. Request for Comments RFC 8478, Internet Engineering Task Force, October 2018
work page 2018
-
[16]
K. Masui, M. Amiri, L. Connor, M. Deng, M. Fandino, C. H¨ ofer, M. Halpern, D. Hanna, A. D. Hincks, G. Hinshaw, J. M. Parra, L. B. Newburgh, J. R. Shaw, and K. Vanderlinde. A compression scheme for radio data in high performance computing.Astronomy and Computing, 12:181–190, September 2015
work page 2015
-
[17]
Saad.Iterative Methods for Sparse Linear Systems
Y. Saad.Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, April 2003
work page 2003
-
[18]
Benjamin H. Savitzky, Steven E. Zeltmann, Lauren A. Hughes, Hamish G. Brown, Shiteng Zhao, Philipp M. Pelz, Thomas C. Pekin, Edward S. Barnard, Jennifer Donohue, and Luis Rangel DaCosta. py4DSTEM: A software package for four-dimensional scanning transmission electron microscopy data analysis.Microscopy and Microanalysis, 27(4):712–743, 2021
work page 2021
-
[19]
Alexander Clausen, Dieter Weber, Karina Ruzaeva, Vadim Migunov, Anand Baburajan, Abijith Bahu- leyan, Jan Caron, Rahul Chandra, Sayandip Halder, Magnus Nord, Knut M¨ uller-Caspary, and Rafal E. Dunin-Borkowski. LiberTEM: Software platform for scalable multidimensional data processing in trans- mission electron microscopy.Journal of Open Source Software, 5...
work page 2006
-
[20]
C. E. Shannon. A mathematical theory of communication.The Bell System Technical Journal, 27(3):379–423, July 1948
work page 1948
-
[21]
D. H. Krantz, R. D. Luce, P. Suppes, and A. Tversky.Foundations of Measurement - Volume 1: Additive and Polynomial Representations. Academic Press, Incorporated, 1971
work page 1971
- [22]
-
[23]
David L. Donoho. Compressed sensing.IEEE Transactions on information theory, 52(4):1289–1306, 2006
work page 2006
-
[24]
S. Foucart and H. Rauhut.A Mathematical Introduction to Compressive Sensing. Springer New York, 2013
work page 2013
-
[25]
A. B´ ech´ e, B. Goris, B. Freitag, and J. Verbeeck. Development of a fast electromagnetic beam blanker for compressed sensing in scanning transmission electron microscopy.Applied Physics Letters, 108(9):093103, February 2016. 14
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.