pith. sign in

arxiv: 2606.03570 · v2 · pith:AX2YPO6Snew · submitted 2026-06-02 · 💻 cs.IT · math.IT

STC: Reversible Digit-Context Decomposition for BWT-Family Text Compression

Pith reviewed 2026-06-28 08:12 UTC · model grok-4.3

classification 💻 cs.IT math.IT
keywords Burrows-Wheeler transformtext compressiondigit decompositionenwik9BWT-family compressorcontext separationreversible transformationside streams
0
0 comments X

The pith

Digit-context decomposition before BWT reduces the enwik9 archive by 2.6 million bytes relative to the unsplit control.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

STC identifies runs of digits in structured text and replaces each run with a single unambiguous placeholder in the main stream. The removed digits are written to separate side streams that are conditioned on run length and surrounding context. These side streams rely on stable bucket ordering and compact packing so the decoder can restore the original sequence without any extra permutation information. Both the normalized main stream and the side streams are then fed to the same fixed BWT-style component coder. An ablation on the full enwik9 corpus shows the split produces a measurably smaller archive than the no-split baseline while still allowing exact reconstruction.

Core claim

STC replaces digit runs in the input with unambiguous placeholders, stores the removed digits in length- and context-conditioned side streams that use stable bucket ordering and compact digit packing, and feeds the resulting streams to a fixed internal BWT/M03-style coder; on enwik9 the resulting archive is 157,388,188 bytes with a 183,174-byte decoder source package.

What carries the argument

Reversible digit-context decomposition that substitutes digit runs with placeholders in the main stream and packs the digits into ordered side streams.

If this is right

  • The digit-context split reduces the main archive by 2,629,561 bytes on full enwik9.
  • The combined archive plus decoder source totals 157,571,362 bytes under the local LTCB-style accounting.
  • No separate permutation data is required because the side-stream design reconstructs run order deterministically.
  • Full decode followed by SHA-256 verification confirms that the original text is recovered exactly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same placeholder-plus-side-stream pattern could be applied to other varying numeric fields such as dates or coordinates without changing the core BWT stage.
  • Because the decomposition is placed before the transform, it is independent of the particular BWT implementation and could be tested with any other BWT-family coder.
  • The method may produce larger gains on corpora that contain more identifier-style numeric runs than natural-language text.

Load-bearing premise

The side streams preserve stable bucket ordering and compact digit packing so that the decoder can recover the exact original run order from the normalized main stream alone.

What would settle it

An input where applying the digit-context split produces a larger total archive size than the no-split control, or where the decoder output fails a byte-for-byte match with the original file.

Figures

Figures reproduced from arXiv: 2606.03570 by Anling Xiang, Jingyang Du, Yang Shen.

Figure 1
Figure 1. Figure 1: Overview of the STC compression and decompression pipeline. Digit runs are removed from the main stream before BWT and restored from context-conditioned side streams during decompression. The archive is then the concatenation of a small mode/length header, a BWT component encoding of m, and BWT component encodings of the side streams. The main component uses a text-heavy fixed profile and the side componen… view at source ↗
Figure 2
Figure 2. Figure 2: Example of reversible digit-context decomposition. Digit runs are replaced by placeholders in the main stream and encoded in side streams selected by run length and local context. 3.4.5 Recomposition and reversibility Proposition 1. Given the main stream and all side streams produced by STC decomposition, recomposition reconstructs the original byte string exactly. Proof. Every digit position in the input … view at source ↗
Figure 3
Figure 3. Figure 3: Full-enwik9 same-coder ablation. Each row changes only the digit-context decomposition while keeping the component coder fixed. Component Raw bytes Compressed bytes Notes m 1,000,000,000 148,105,611 Placeholder-normalized main stream s0 3,507,416 1,028,681 Bucket 1 stream s1 4,761,161 2,727,656 Bucket 2 stream s2 3,126,704 1,473,594 Bucket 3 stream s3 2,648,434 232,291 Bucket 4 first-byte stream s4 2,648,4… view at source ↗
read the original abstract

Burrows-Wheeler-transform-based compressors rely on local context regularity, but structured text also contains dates, counters, identifiers, coordinates, and other digit runs whose values vary differently from their surrounding tokens. STC is a practical BWT-family compressor that separates this source of variation before the component BWT stage. It replaces digit runs in the main stream with an unambiguous placeholder and stores the removed digits in length- and context-conditioned side streams. The side streams use stable bucket ordering and compact digit packing, so the decoder can reconstruct the original run order from the normalized main stream without storing a separate permutation. The resulting components are encoded by a fixed internal BWT/M03-style component coder. On enwik9, STC produces a 157,388,188-byte archive with a 183,174-byte decoder source package, giving a local LTCB-style total of 157,571,362 bytes. A full-enwik9 same-coder ablation shows that the digit-context decomposition reduces the archive by 2,629,561 bytes relative to the no-split control. The result is locally verified by full decode and SHA-256 matching; official benchmark status requires independent maintainer-side verification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The paper presents STC, a BWT-family text compressor that performs reversible digit-context decomposition: digit runs are replaced by placeholders in the main stream and stored in length- and context-conditioned side streams that use stable bucket ordering and compact packing to enable reconstruction without an explicit permutation. The components are then encoded by a fixed internal BWT/M03-style coder. On enwik9 the method yields a 157,388,188-byte archive plus a 183,174-byte decoder package (LTCB-style total 157,571,362 bytes) and an ablation shows a 2,629,561-byte reduction relative to the identical coder without the split; the pipeline is verified by full decode plus SHA-256 match.

Significance. If the reported sizes and ablation hold, the work supplies a concrete, locally verified technique for isolating digit-run variation that is orthogonal to the core BWT stage, producing a measurable net saving on a standard corpus while keeping the decoder self-contained. The same-coder ablation and end-to-end verification strengthen the empirical claim.

minor comments (4)
  1. [abstract, §3] The abstract and §3 (side-stream design) state that context conditioning and stable bucket ordering suffice for reconstruction, but provide no pseudocode or parameter values for the conditioning function; a short algorithmic description would improve reproducibility.
  2. [§2] No comparison is given against prior published methods for handling numeric runs inside BWT compressors (e.g., digit-aware preprocessing in other LTCB entries); adding a brief related-work paragraph would situate the 2.63 MB gain.
  3. [§4] The ablation reports a single aggregate saving; stating the per-component byte counts (main stream vs. side streams) for both the split and control runs would make the source of the improvement explicit.
  4. [abstract] The decoder source size (183,174 bytes) is included in the LTCB total, but the manuscript does not indicate whether this size was measured with the same compiler flags and library versions used for the archive; a one-sentence clarification would remove ambiguity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the detailed summary, positive significance assessment, and recommendation of minor revision. No specific major comments were provided in the report, so there are no individual points requiring point-by-point rebuttal or changes at this stage. We remain available to address any additional concerns the editor or referee may wish to raise.

Circularity Check

0 steps flagged

No significant circularity; empirical ablation result stands on measured sizes

full rationale

The manuscript presents a compression technique whose central claim is an empirical size reduction (2.63 MB on enwik9 via same-coder ablation) together with a decoder reconstruction path that is exercised by full decode + SHA-256 verification. No equations, fitted parameters, or self-citations are shown that would reduce the reported savings or the side-stream reconstruction to a definitional identity. The side-stream design (stable bucket ordering, compact packing) is described as an implementation choice whose correctness is directly tested by the reported verification step, not assumed without evidence. This is a standard empirical engineering result with an explicit control; the derivation chain contains no load-bearing self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the domain assumption that digit runs form a separable source of variation whose removal improves BWT locality, plus standard properties of BWT and stable sorting; no numeric free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption BWT-family compressors rely on local context regularity that digit runs disrupt
    Stated in the opening sentence of the abstract as the motivation for separation.
  • domain assumption Digit runs can be unambiguously identified and replaced by placeholders while preserving decodability via side streams
    Core premise of the reversible decomposition described in the abstract.

pith-pipeline@v0.9.1-grok · 5737 in / 1453 out tokens · 46597 ms · 2026-06-28T08:12:21.068766+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    1994 , url =

    A Block-Sorting Lossless Data Compression Algorithm , author =. 1994 , url =

  2. [2]

    An Analysis of the

    Manzini, Giovanni , journal =. An Analysis of the. 2001 , doi =

  3. [3]

    , journal =

    Fenwick, Peter M. , journal =. The. 1996 , doi =

  4. [4]

    High-Performance

    Trinca, Daniel , journal =. High-Performance. 2005 , url =

  5. [5]

    Efficient Construction of the

    D. Efficient Construction of the. Information and Computation , volume =. 2023 , doi =

  6. [6]

    2020 , doi =

    Niemi, Arto and Teuhola, Jukka , journal =. 2020 , doi =

  7. [7]

    Adjeroh, Donald and Bell, Timothy and Mukherjee, Amar , publisher =. The. 2008 , doi =

  8. [8]

    1990 , url =

    Text Compression , author =. 1990 , url =

  9. [9]

    2007 , doi =

    Data Compression: The Complete Reference , author =. 2007 , doi =

  10. [10]

    2017 , url =

    Introduction to Data Compression , author =. 2017 , url =

  11. [11]

    2026 , url =

    bzip2 and libbzip2 , author =. 2026 , url =

  12. [12]

    2026 , url =

    libbsc: Block Sorting Compressor , author =. 2026 , url =

  13. [13]

    bsc-m03: Experimental Block Sorting Compressor Based on

    Grebnov, Ilya , year =. bsc-m03: Experimental Block Sorting Compressor Based on

  14. [14]

    On Bijective Variants of the

    Kufleitner, Manfred , journal =. On Bijective Variants of the. 2009 , url =

  15. [15]

    A Bijective String Sorting Transform

    A Bijective String Sorting Transform , author =. arXiv preprint arXiv:1201.3077 , year =

  16. [16]

    An Extension of the

    Mantaci, Sabrina and Restivo, Antonio and Rosone, Giovanna and Sciortino, Marinella , journal =. An Extension of the. 2007 , doi =

  17. [17]

    Constructing the Bijective and the Extended

    Bannai, Hideo and Karkkainen, Juha and Koppl, Dominik and Piatkowski, Martin , journal =. Constructing the Bijective and the Extended. 2019 , url =

  18. [18]

    In-Place Bijective

    Koppl, Dominik and Sadakane, Kunihiko , journal =. In-Place Bijective. 2020 , url =

  19. [19]

    Fully Functional Suffix Trees and Optimal Text Searching in

    Gagie, Travis and Navarro, Gonzalo and Prezza, Nicola , journal =. Fully Functional Suffix Trees and Optimal Text Searching in. 2020 , doi =

  20. [20]

    Bijective

    Badkobeh, Golnaz and Bannai, Hideo and Koppl, Dominik , journal =. Bijective. 2024 , url =

  21. [21]

    Bit Catastrophes for the

    Boucher, Christina and Cenzato, Davide and Lipt. Bit Catastrophes for the. Theory of Computing Systems , year =

  22. [22]

    On the Compressiveness of the

    Bannai, Hideo and I, Tomohiro and Nakashima, Yuto , booktitle =. On the Compressiveness of the. 2025 , doi =

  23. [23]

    Proceedings of SEA 2025 , series =

    Adler, Isabel and Cenzato, Davide and Lipt. Proceedings of SEA 2025 , series =. 2025 , doi =

  24. [24]

    Decomposing Words for Enhanced Compression: Exploring the Number of Runs in the Extended

    Ingels, Florian and Denis, Ana. Decomposing Words for Enhanced Compression: Exploring the Number of Runs in the Extended. arXiv preprint arXiv:2506.04926 , year =

  25. [25]

    Theoretical Computer Science , volume =

    A New Class of Searchable and Provably Highly Compressible String Transformations , author =. Theoretical Computer Science , volume =. 2018 , url =

  26. [26]

    and Shapira, Dana , journal =

    Fruchtman, Amir and Gross, Yossi and Klein, Shmuel T. and Shapira, Dana , journal =. Weighted. 2021 , url =

  27. [27]

    Software: Practice and Experience , volume =

    Revisiting Dictionary-Based Compression , author =. Software: Practice and Experience , volume =. 2005 , doi =

  28. [28]

    1999 , url =

    Browsing and Searching Compressed Documents , author =. 1999 , url =

  29. [29]

    2000 , doi =

    Liefke, Hartmut and Suciu, Dan , booktitle =. 2000 , doi =

  30. [30]

    and Haritsa, Jayant R

    Tolani, Parag M. and Haritsa, Jayant R. , booktitle =. 2002 , doi =

  31. [31]

    Path Queries on Compressed

    Buneman, Peter and Grohe, Martin and Koch, Christoph , booktitle =. Path Queries on Compressed. 2003 , url =

  32. [32]

    Proceedings of SIGMOD , pages =

    Integrating Compression and Execution in Column-Oriented Database Systems , author =. Proceedings of SIGMOD , pages =. 2006 , doi =

  33. [33]

    Stonebraker, Michael and Abadi, Daniel J. and Batkin, Adam and Chen, Xuedong and Cherniack, Mitch and Ferreira, Miguel and Lau, Edmond and Lin, Amerson and Madden, Samuel and O'Neil, Elizabeth and O'Neil, Patrick and Rasin, Alex and Tran, Nga and Zdonik, Stanley , booktitle =. 2005 , url =

  34. [34]

    Proceedings of the VLDB Endowment , volume =

    Dremel: Interactive Analysis of Web-Scale Datasets , author =. Proceedings of the VLDB Endowment , volume =. 2010 , doi =

  35. [35]

    Proceedings of ICDE , pages =

    Compressing Relations and Indexes , author =. Proceedings of ICDE , pages =. 1998 , doi =

  36. [36]

    Super-Scalar

    Zukowski, Marcin and Heman, Sandor and Nes, Niels and Boncz, Peter , booktitle =. Super-Scalar. 2006 , doi =

  37. [37]

    Software: Practice and Experience , volume =

    Decoding Billions of Integers per Second through Vectorization , author =. Software: Practice and Experience , volume =. 2015 , doi =

  38. [38]

    Lemire, Daniel and Kurz, Nathan and Rupp, Christoph , journal =. Stream. 2018 , doi =

  39. [39]

    Proceedings of the VLDB Endowment , volume =

    Gorilla: A Fast, Scalable, In-Memory Time Series Database , author =. Proceedings of the VLDB Endowment , volume =. 2015 , doi =

  40. [40]

    Computer , volume =

    A Technique for High-Performance Data Compression , author =. Computer , volume =. 1984 , doi =

  41. [41]

    1999 , url =

    Managing Gigabytes: Compressing and Indexing Documents and Images , author =. 1999 , url =

  42. [42]

    Asymmetric Numeral Systems: Entropy Coding Combining Speed of

    Duda, Jarek , journal =. Asymmetric Numeral Systems: Entropy Coding Combining Speed of. 2013 , url =

  43. [43]

    Generalized

    Rissanen, Jorma , journal =. Generalized. 1976 , doi =

  44. [44]

    Communications of the ACM , volume =

    Arithmetic Coding for Data Compression , author =. Communications of the ACM , volume =. 1987 , doi =

  45. [45]

    Proceedings of the IRE , volume =

    A Method for the Construction of Minimum-Redundancy Codes , author =. Proceedings of the IRE , volume =. 1952 , doi =

  46. [46]

    IEEE Transactions on Information Theory , volume =

    A Universal Algorithm for Sequential Data Compression , author =. IEEE Transactions on Information Theory , volume =. 1977 , doi =

  47. [47]

    IEEE Transactions on Information Theory , volume =

    Compression of Individual Sequences via Variable-Rate Coding , author =. IEEE Transactions on Information Theory , volume =. 1978 , doi =

  48. [48]

    IEEE Transactions on Communications , volume =

    Data Compression Using Adaptive Coding and Partial String Matching , author =. IEEE Transactions on Communications , volume =. 1984 , doi =

  49. [49]

    2026 , url =

    Rules for the Large Text Compression Benchmark , author =. 2026 , url =

  50. [50]

    Hutter, Marcus , year =. The

  51. [51]

    2026 , url =

    Reproducible Builds , author =. 2026 , url =