Compressed-Resident Genomics: Full-Pipeline Device-Resident GPU LZ77 Decode with Position-Invariant Random Access
Pith reviewed 2026-06-26 19:20 UTC · model grok-4.3
The pith
A full device-resident GPU LZ77 pipeline decodes genomic data at 260 GB/s while supporting random access to individual reads in 0.362 ms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By extending ACEAPEX into a full device-resident GPU decode pipeline, entropy decoding and match resolution both stay on the device to reach 260 GB/s on FASTQ, a compact coordinate index supports position-invariant random access that decodes an arbitrary read in 0.362 ms, and a range-decode strategy decouples output size from VRAM to sustain 165.7 GB/s on a 50 GB genome, all while remaining bit-perfect.
What carries the argument
The device-resident GPU decode pipeline that performs entropy decoding and match resolution without host intervention, built on the absolute-offset parallel LZ77 codec ACEAPEX.
If this is right
- Full on-device processing removes host-device transfer overhead during genomic decompression.
- Position-invariant random access allows direct extraction of individual reads without decompressing preceding data.
- Range decoding enables processing of genomes larger than available VRAM at 165.7 GB/s.
- The smaller read-to-block index reduces storage overhead compared with standard .fai files.
Where Pith is reading between the lines
- The same on-device pipeline structure could apply to other LZ77-based formats that currently force full sequential decompression.
- Combining the pipeline with the faster open DietGPU entropy stage would create an entirely open high-throughput stack for compressed genomics.
- Random-access performance at sub-millisecond latency per read could support interactive queries over petabyte-scale archives without first materializing decompressed copies.
Load-bearing premise
The ACEAPEX LZ77 codec can be extended to a complete on-device pipeline while preserving bit-perfect output and the claimed speeds without any hidden host-device transfers or post-processing steps.
What would settle it
Measure the pipeline on a different GPU while confirming no CPU involvement occurs between entropy and match stages and that the reported 260 GB/s throughput and 0.362 ms random-read latency are reproduced.
read the original abstract
Genomic archives grow faster than decompression keeps up: the European Nucleotide Archive holds tens of petabytes of fastq.gz, and gzip is fundamentally sequential. GPU decompressors (nvCOMP DEFLATE at ~50GB/s on A100) decode whole files with no random access; CPU genomic tools (CRAM, samtools) support region seeks but only at CPU speed. We extend ACEAPEX, an absolute-offset parallel LZ77 codec included in the official lzbench 2.3 release, with three contributions absent from our prior work. First, a full device-resident GPU decode pipeline (entropy and match resolution both on-device) reaching up to 260GB/s on FASTQ, closing the match-phase-only gap of the earlier paper. Second, position-invariant random access with a compact coordinate index: an arbitrary read decodes in 0.362ms, ~6x faster than warm samtools faidx, with a read-to-block index 6.3x smaller than a .fai. Third, a range-decode strategy that decouples output size from VRAM, sustaining 165.7GB/s on a 50GB genome where whole-file decode runs out of memory. All results are bit-perfect. We also measure Meta's open DietGPU ANS on H100 at 592GB/s decode, faster than the proprietary entropy stage we currently use, showing a fully open high-throughput stack is viable. Code is MIT-licensed.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends ACEAPEX, an absolute-offset parallel LZ77 codec, with a full device-resident GPU decode pipeline (entropy decoding and match resolution both on-device) for FASTQ data. It reports up to 260 GB/s throughput, position-invariant random access decoding an arbitrary read in 0.362 ms with a read-to-block index 6.3x smaller than .fai, and a range-decode strategy sustaining 165.7 GB/s on a 50 GB genome without exceeding VRAM. All results are claimed bit-perfect; the work also benchmarks Meta's DietGPU ANS at 592 GB/s on H100 and releases code under MIT license.
Significance. If the device-resident claims and throughput numbers hold, the work would advance GPU-accelerated genomic decompression by closing the match-phase-only gap from prior ACEAPEX work, enabling random access and large-file handling without host transfers. The open MIT-licensed code is a clear strength supporting reproducibility and further development of open high-throughput stacks.
major comments (2)
- [Abstract and pipeline description] Abstract and pipeline description: the central claim of a 'full device-resident GPU decode pipeline' with both entropy and match resolution on-device is load-bearing for the 260 GB/s and 165.7 GB/s figures, yet no kernel-launch sequence, memory residency proof, or timing breakdown separating the two stages is supplied to confirm absence of cudaMemcpy or host post-processing.
- [Results and evaluation sections] Results and evaluation sections: concrete throughput, latency (0.362 ms), and index-size ratios are reported without measurement methodology, error bars, full hardware details, or bit-perfect verification procedure, preventing assessment of whether post-hoc tuning or selective reporting affects the numbers.
minor comments (1)
- [Abstract] The proprietary entropy stage used for the main results is not named, while DietGPU is presented as an open alternative; adding this detail would clarify the comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that the device-resident pipeline claim and the reported performance numbers require stronger supporting details for full substantiation and reproducibility. We address each major comment below and will revise the manuscript to incorporate the requested clarifications.
read point-by-point responses
-
Referee: [Abstract and pipeline description] Abstract and pipeline description: the central claim of a 'full device-resident GPU decode pipeline' with both entropy and match resolution on-device is load-bearing for the 260 GB/s and 165.7 GB/s figures, yet no kernel-launch sequence, memory residency proof, or timing breakdown separating the two stages is supplied to confirm absence of cudaMemcpy or host post-processing.
Authors: We agree that explicit documentation is needed to substantiate the full device-resident claim. While Section 3 of the manuscript describes the pipeline architecture, we will add a new subsection with the exact sequence of CUDA kernel launches (entropy decode followed by match resolution), VRAM residency proofs via allocation details, and a timing breakdown table separating the two stages. This will explicitly confirm the absence of cudaMemcpy or host post-processing during decode. revision: yes
-
Referee: [Results and evaluation sections] Results and evaluation sections: concrete throughput, latency (0.362 ms), and index-size ratios are reported without measurement methodology, error bars, full hardware details, or bit-perfect verification procedure, preventing assessment of whether post-hoc tuning or selective reporting affects the numbers.
Authors: We acknowledge that the current presentation lacks sufficient methodological transparency. We will expand the Results and Evaluation sections to include: full hardware specifications and software environment, the measurement protocol (including run counts, warm-up procedures, and timing methods), error bars or standard deviations for all reported figures, and a detailed description of the bit-perfect verification procedure (byte-for-byte comparison against a reference CPU decoder). These additions will allow independent verification of the results. revision: yes
Circularity Check
No circularity; performance claims are benchmark-driven with no self-referential derivations
full rationale
The manuscript reports empirical throughput, latency, and memory figures from GPU kernel runs on FASTQ and genome data. No equations, ansatzes, fitted parameters, or uniqueness theorems appear. References to prior ACEAPEX work describe the baseline being extended rather than supplying load-bearing premises that the new results reduce to by construction. The device-resident pipeline, random-access index, and range-decode strategy are presented as implementation contributions whose validity is asserted via bit-perfect output and measured speeds, not via any definitional or self-citation reduction. This is the expected non-finding for a systems-performance paper whose central claims are externally falsifiable benchmark numbers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LZ77 decompression (entropy decoding and match resolution) can be performed entirely on-device while remaining bit-perfect.
Forward citations
Cited by 1 Pith paper
-
Unified Position-Invariant Random Access Through Two Compression Layers via Absolute-Offset Coordinates: A Bit-Perfect Device-Resident Proof
Absolute-offset design enables unified position-invariant random access through entropy and match compression layers with one coordinate and bit-perfect verification.
Reference graph
Works this paper leans on
-
[1]
ACEAPEX: Parallel LZ77 Decoding via Encode-Time Absolute Offset Resolution
Y. Shavidze, “ACEAPEX: Parallel LZ77 Decod- ing via Encode-Time Absolute Offset Resolution,” arXiv:2606.04268, 2026. 4
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Massively-parallel lossless data decompression,
E. Sitaridi et al., “Massively-parallel lossless data decompression,”ICPP, 2016, pp. 242–247
2016
-
[3]
Rapidgzip: Parallel decompression and seeking in gzip streams using cache prefetching,
M. Köhler, T. Bingmann, and P. Sanders, “Rapidgzip: Parallel decompression and seeking in gzip streams using cache prefetching,”HPDC, 2023, pp. 295–307
2023
-
[4]
Recoil: Parallel rANS decoding with decoder-adaptive scalability,
T. Lin et al., “Recoil: Parallel rANS decoding with decoder-adaptive scalability,”ICS, 2023
2023
-
[5]
SAGe: Storage-Aware Genomic data compression,
“SAGe: Storage-Aware Genomic data compression,” arXiv:2504.03732, 2025
-
[6]
DietGPU: GPU-based lossless compression,
Meta, “DietGPU: GPU-based lossless compression,” open-source, 2022
2022
-
[7]
lzbench: in-memory bench- mark of open-source LZ77/LZSS/LZMA com- pressors,
P. Skibiński, “lzbench: in-memory bench- mark of open-source LZ77/LZSS/LZMA com- pressors,” Version 2.3, 2026. [Online]. Avail- able:https://github.com/inikep/lzbench/ releases/tag/v2.3 5
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.