pith. sign in

arxiv: 2604.06221 · v1 · submitted 2026-03-26 · 📡 eess.SP · cond-mat.mtrl-sci

Inference-Sufficient Representations for High-Throughput Measurement: Lessons from Lossless Compression Benchmarks in 4D-STEM

Pith reviewed 2026-05-15 00:27 UTC · model grok-4.3

classification 📡 eess.SP cond-mat.mtrl-sci
keywords 4D-STEMlossless compressionBloscHDF5sparsitydata reductionelectron microscopyhigh-throughput imaging
0
0 comments X p. Extension

The pith

4D-STEM datasets can be losslessly compressed by more than ten times using standard algorithms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks thirteen lossless compression implementations across five representative 4D-STEM datasets that range in size from 8 MiB to 8 GiB and in sparsity from 49.5 to 92.8 percent. It shows that several Blosc-family methods reach compression ratios comparable to the slowest gzip setting while running tens of times faster for both writing and reading. The observed ratios follow a tight power-law relationship with sparsity. The authors conclude that routine use of these compressors can shrink data volumes enough to ease storage and transfer pressures, yet the approach of preserving every measured intensity will still fail to keep pace with rising detector speeds.

Core claim

Lossless compression of 4D-STEM data routinely achieves factors greater than 10 times; blosc_zstd matches the compression ratio of gzip-9 (mean 13.5 times versus 12.3 times) while compressing 19 to 69 times faster and reading 1.9 to 2.6 times faster, with ratios that follow a power law in sparsity (R squared equals 0.99).

What carries the argument

Systematic timing and ratio benchmarks of thirteen HDF5-compatible lossless compressors on five datasets, which reveal that compression ratio is a deterministic power-law function of measured sparsity.

If this is right

  • 4D-STEM data can be routinely compressed by more than 10 times without loss of measurement values.
  • blosc_zstd and five other top implementations give the best speed-ratio trade-off for different workflow needs.
  • Compression and decompression times are highly reproducible (coefficient of variation less than 2 percent).
  • Sparsity alone predicts compression performance across dataset sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Interactive visualization and remote transfer of 4D-STEM data become practical once these compressors are adopted as defaults.
  • Detector-rate growth will force experimenters to replace raw-density storage with formats that retain only the intensities needed for a given scientific inference.
  • The power-law dependence suggests that any future 4D-STEM acquisition strategy that increases sparsity will automatically improve compressibility.

Load-bearing premise

The five selected datasets capture the range of sparsity and file sizes that will appear in future high-throughput 4D-STEM experiments.

What would settle it

Collect a new 4D-STEM dataset whose sparsity lies outside the 49.5-92.8 percent interval or whose size exceeds 8 GiB, apply the same compressors, and test whether the measured ratios deviate from the fitted power-law prediction.

Figures

Figures reproduced from arXiv: 2604.06221 by Albina Borisevich, Andrew R. Lupini, Miaofang Chi, Ondrej Dyck, Rama K. Vasudevan, Stephen Jesse.

Figure 1
Figure 1. Figure 1: Cross-dataset performance comparison of top 10 compression implementations. (A) Compression [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Multi-dimensional performance comparison of representative compression implementations. The [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Compression ratio as a function of data sparsity across five datasets. Each point represents the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of chunking strategy on compression performance for representative implementations. (A) [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Four-dimensional scanning transmission electron microscopy (4D-STEM) generates multi-gigabyte datasets, creating a growing mismatch between acquisition rates and practical storage, transfer, and interactive visualization capabilities. We systematically benchmark 13 lossless compression implementations across 5 representative datasets (8~MiB to 8~GiB, 49.5--92.8\% sparsity), with 10 independent runs per method. HDF5 provides built-in gzip compression, of which gzip-9 typically achieves the highest compression ratio but is slow. We therefore evaluate widely available alternatives (via hdf5plugin), including the Blosc family. As a representative comparison, blosc\_zstd achieves compression comparable to gzip-9 (mean 13.5$\times$ vs 12.3$\times$) while compressing 19--69$\times$ faster and reading 1.9--2.6$\times$ faster across datasets. Compression ratios are deterministic, and timing measurements are highly reproducible (CV $<$2\%). Compression performance follows a power law with sparsity ($R^2 = 0.99$), ranging from 5$\times$ for moderately sparse data to 35$\times$ for highly sparse data. We identify six top-performing implementations optimized for different use cases and demonstrate that 4D-STEM data can be routinely compressed by $>$10$\times$. While these results provide practical guidance for lossless compression selection, the broader conclusion is that lossless compression preserves measurements but does not by itself guarantee sustainable high-throughput workflows. As detector rates rise, data handling will increasingly require inference-driven representations -- i.e., deciding what must be preserved to support a scientific inference, rather than defaulting to storing fully dense raw measurements.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper benchmarks 13 lossless compression implementations on 5 representative 4D-STEM datasets (8 MiB to 8 GiB, 49.5-92.8% sparsity) using 10 independent runs per method. It reports that blosc_zstd achieves mean compression ratios comparable to gzip-9 (13.5× vs 12.3×) while being 19-69× faster to compress and 1.9-2.6× faster to read, with all ratios deterministic and timings highly reproducible (CV <2%). Compression ratio follows an observed power-law relation with sparsity (R²=0.99), ranging from 5× to 35×, and the work concludes that 4D-STEM data can be routinely compressed by >10× while advocating that lossless compression alone is insufficient for sustainable high-throughput workflows and that inference-sufficient representations will be needed.

Significance. If the empirical benchmarks hold, the work supplies reproducible, practical guidance for compression codec selection in high-throughput 4D-STEM, with the power-law fit offering a simple predictive relation based on sparsity. The low-variability timing data and explicit comparison of speed versus ratio strengthen the case for adopting faster alternatives to gzip-9. The broader point that raw lossless storage will not scale indefinitely is well-taken and points toward needed future work on inference-driven data reduction.

major comments (1)
  1. [Abstract] Abstract: The claim that the benchmarks demonstrate 4D-STEM data 'can be routinely compressed by >10×' is not supported by the reported measurements. The power-law fit yields only 5× at 49.5% sparsity, the lowest value in the tested range, yet the abstract presents >10× as routine without additional argument that future high-throughput experiments will avoid this moderate-sparsity regime.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful review and constructive comment on the abstract. We agree that the claim of routine >10× compression requires qualification in light of the observed sparsity dependence, and we will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that the benchmarks demonstrate 4D-STEM data 'can be routinely compressed by >10×' is not supported by the reported measurements. The power-law fit yields only 5× at 49.5% sparsity, the lowest value in the tested range, yet the abstract presents >10× as routine without additional argument that future high-throughput experiments will avoid this moderate-sparsity regime.

    Authors: We thank the referee for identifying this point. The power-law fit (R² = 0.99) indeed predicts approximately 5× compression at 49.5% sparsity and reaches 10× near 60% sparsity. Our five datasets span 49.5–92.8% sparsity, with four of them yielding >10×, but the abstract does not explicitly address why moderate-sparsity cases may be less representative of future high-throughput 4D-STEM experiments. We will revise the abstract to state that ratios exceed 10× for sparsities above ~60% (per the fitted relation) and add a short clause noting that typical 4D-STEM datasets in the literature often operate above this threshold, while retaining the full range of measured results for transparency. revision: yes

Circularity Check

0 steps flagged

No circularity: all central results are direct empirical measurements and observed fits

full rationale

The paper reports benchmark measurements of compression ratios, speeds, and reproducibility (CV <2%) across five fixed datasets, plus a post-hoc power-law fit to the observed sparsity-ratio pairs (R²=0.99). No derivation chain exists that reduces a claimed prediction or first-principles result to its own inputs by construction. There are no self-definitional equations, fitted parameters renamed as predictions, load-bearing self-citations, uniqueness theorems, or smuggled ansatzes. The >10× routine-compression statement is an interpretive summary of the measured range (5×–35×), not a mathematical reduction; any tension with the lowest-sparsity case is a question of representativeness, not circularity. The analysis is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The only fitted element is the power-law relation between sparsity and compression ratio; no new physical constants, particles, or ad-hoc entities are introduced.

free parameters (1)
  • power-law parameters for sparsity-compression relation
    Observed R²=0.99 fit across the five datasets; used to generalize the compression behavior.

pith-pipeline@v0.9.0 · 5647 in / 1306 out tokens · 53598 ms · 2026-05-15T00:27:15.674849+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    Four-dimensional scanning transmission electron microscopy (4D-STEM): From scanning nanodiffraction to ptychography and beyond.Microscopy and Microanalysis, 25(3):563–582, 2019

    Colin Ophus. Four-dimensional scanning transmission electron microscopy (4D-STEM): From scanning nanodiffraction to ptychography and beyond.Microscopy and Microanalysis, 25(3):563–582, 2019

  2. [2]

    Llopart, J

    X. Llopart, J. Alozy, R. Ballabriga, M. Campbell, R. Casanova, V. Gromov, E.H.M. Heijne, T. Poikela, E. Santin, V. Sriskaran, L. Tlustos, and A. Vitkovskiy. Timepix4, a large area pixel detector readout chip which can be tiled on 4 sides providing sub-200 ps timestamp binning.Journal of Instrumentation, 17(01):C01044, January 2022

  3. [3]

    TEMPUS, a Timepix4-based system for the event-based detection of X-rays.Journal of Synchrotron Radiation, 31(Pt 5):1209–1216, July 2024

    Jonathan Correa, Alexandr Ignatenko, David Pennicard, Sabine Lange, Sergei Fridman, Sebastian Karl, Leon Lohse, Bj¨ orn Senfftleben, Ilya Sergeev, Sven Velten, Deepak Prajapat, Lars Bocklage, Huber- tus Bromberger, Andrey Samartsev, Aleksandr Chumakov, Rudolf R¨ uffer, Joachim von Zanthier, Ralf R¨ ohlsberger, and Heinz Graafsma. TEMPUS, a Timepix4-based ...

  4. [4]

    Leonarski, M

    F. Leonarski, M. Br¨ uckner, C. Lopez-Cuenca, A. Mozzanica, H.-C. Stadler, Z. Matˇ ej, A. Castel- lane, B. Mesnet, J. A. Wojdyla, B. Schmitt, and M. Wang. Jungfraujoch: Hardware-accelerated data-acquisition system for kilohertz pixel-array X-ray detectors.Journal of Synchrotron Radiation, 30(1):227–234, January 2023

  5. [5]

    https://www.dectris.com/en/detectors/electron-detectors/for-materials-science/arina/

    DECTRIS ARINA - Hybrid-pixel detector for 4D STEM applications. https://www.dectris.com/en/detectors/electron-detectors/for-materials-science/arina/

  6. [6]

    https://www.dectris.com/en/detectors/electron- detectors/for-materials-science/quadro/

    Direct Electron Detection|Microed|DQE Detector. https://www.dectris.com/en/detectors/electron- detectors/for-materials-science/quadro/

  7. [7]

    Pelz, Ian Johnson, Colin Ophus, Peter Ercius, and Mary C

    Philipp M. Pelz, Ian Johnson, Colin Ophus, Peter Ercius, and Mary C. Scott. Real-Time Interactive 4D-STEM Phase-Contrast Imaging From Electron Event Representation Data: Less computation with the right representation.IEEE Signal Processing Magazine, 39(1):25–31, January 2022

  8. [8]

    Electron Energy Loss Spectroscopy|MerlinEELS

  9. [9]

    Matkraj/read mib, April 2021

    Matus Krajnak. Matkraj/read mib, April 2021

  10. [10]

    EMD 1.0 and ‘emdfile‘: An HDF5 / Python Interface

    Benjamin H Savitzky, Steven E Zeltmann, Alexandra Bruefach, Alexander Rakowski, Mary Scott, Matthew L Henderson, and Colin Ophus. EMD 1.0 and ‘emdfile‘: An HDF5 / Python Interface. Microscopy and Microanalysis, 29(Supplement 1):721–723, August 2023

  11. [11]

    DEFLATE compressed data format specification version 1.3

    Peter Deutsch. DEFLATE compressed data format specification version 1.3. Technical report, 1996. 13

  12. [12]

    An overview of the HDF5 technology suite and its applications

    Mike Folk, Gerd Heber, Quincey Koziol, Elena Pourmal, and Dana Robinson. An overview of the HDF5 technology suite and its applications. InProceedings of the EDBT/ICDT 2011 Workshop on Array Databases, AD ’11, pages 36–47, New York, NY, USA, March 2011. Association for Computing Machinery

  13. [13]

    https://support.hdfgroup.org/documentation/

    Documentation. https://support.hdfgroup.org/documentation/

  14. [14]

    Why Modern CPUs Are Starving and What Can Be Done about It.Computing in Science & Engineering, 12(2):68–71, March 2010

    Francesc Alted. Why Modern CPUs Are Starving and What Can Be Done about It.Computing in Science & Engineering, 12(2):68–71, March 2010

  15. [15]

    Zstandard Compression and the application/zstd Media Type

    Yann Collet and Murray Kucherawy. Zstandard Compression and the application/zstd Media Type. Request for Comments RFC 8478, Internet Engineering Task Force, October 2018

  16. [16]

    Masui, M

    K. Masui, M. Amiri, L. Connor, M. Deng, M. Fandino, C. H¨ ofer, M. Halpern, D. Hanna, A. D. Hincks, G. Hinshaw, J. M. Parra, L. B. Newburgh, J. R. Shaw, and K. Vanderlinde. A compression scheme for radio data in high performance computing.Astronomy and Computing, 12:181–190, September 2015

  17. [17]

    Saad.Iterative Methods for Sparse Linear Systems

    Y. Saad.Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, April 2003

  18. [18]

    Savitzky, Steven E

    Benjamin H. Savitzky, Steven E. Zeltmann, Lauren A. Hughes, Hamish G. Brown, Shiteng Zhao, Philipp M. Pelz, Thomas C. Pekin, Edward S. Barnard, Jennifer Donohue, and Luis Rangel DaCosta. py4DSTEM: A software package for four-dimensional scanning transmission electron microscopy data analysis.Microscopy and Microanalysis, 27(4):712–743, 2021

  19. [19]

    Dunin-Borkowski

    Alexander Clausen, Dieter Weber, Karina Ruzaeva, Vadim Migunov, Anand Baburajan, Abijith Bahu- leyan, Jan Caron, Rahul Chandra, Sayandip Halder, Magnus Nord, Knut M¨ uller-Caspary, and Rafal E. Dunin-Borkowski. LiberTEM: Software platform for scalable multidimensional data processing in trans- mission electron microscopy.Journal of Open Source Software, 5...

  20. [20]

    C. E. Shannon. A mathematical theory of communication.The Bell System Technical Journal, 27(3):379–423, July 1948

  21. [21]

    D. H. Krantz, R. D. Luce, P. Suppes, and A. Tversky.Foundations of Measurement - Volume 1: Additive and Polynomial Representations. Academic Press, Incorporated, 1971

  22. [22]

    Candes, J

    E.J. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information.IEEE Transactions on Information Theory, 52(2):489–509, February 2006

  23. [23]

    David L. Donoho. Compressed sensing.IEEE Transactions on information theory, 52(4):1289–1306, 2006

  24. [24]

    Foucart and H

    S. Foucart and H. Rauhut.A Mathematical Introduction to Compressive Sensing. Springer New York, 2013

  25. [25]

    B´ ech´ e, B

    A. B´ ech´ e, B. Goris, B. Freitag, and J. Verbeeck. Development of a fast electromagnetic beam blanker for compressed sensing in scanning transmission electron microscopy.Applied Physics Letters, 108(9):093103, February 2016. 14