pith. machine review for the scientific record. sign in

arxiv: 2602.13735 · v2 · submitted 2026-02-14 · 💻 cs.DS

Recognition: no theorem link

Compressed Index with Construction in Compressed Space

Authors on Pith no claims yet

Pith reviewed 2026-05-15 22:23 UTC · model grok-4.3

classification 💻 cs.DS
keywords compressed indexstring complexitystreaming constructiontext indexingpattern searchdata structures
0
0 comments X

The pith

A string index of size O(δ log n/δ) supports fast searches and builds itself in the same space via one streaming pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes an index for a string s of length n whose space is O(δ log n/δ), where δ is the string complexity of s. Searches for any pattern of length m finish in O(m + (occ + 1) log^ε n) time for any fixed ε > 0. The construction runs in expected O(n log n) time during a single left-to-right pass while using only O(δ log n/δ) space at every step. A sympathetic reader would care because prior indexes either needed extra working space during construction or could not guarantee the final compressed bound throughout the process. The result removes a practical barrier to indexing large texts under tight memory limits.

Core claim

The index on s with O(δ log n/δ) space can search in s any string of length m in O(m + (occ + 1) log^ε n) time. The index can be built in O(n log n) expected time by one left-to-right pass on the string s in a streaming fashion with O(δ log n/δ) construction space. This is the first index that can be constructed within such space and with such time guarantees.

What carries the argument

Streaming left-to-right construction of the compressed index that keeps total space at O(δ log n/δ) throughout the single pass, using hash tables or deterministic dictionaries and avoiding Karp-Rabin fingerprints.

If this is right

  • Search times match the currently best known bounds for compressed indexes.
  • Space usage is nearly optimal and equals the known lower bound O(δ log n/(δ α)) whenever δ = O(n / α²).
  • Construction processes the entire string in one left-to-right pass with expected O(n log n) time.
  • The construction can be made fully deterministic by replacing hash tables with deterministic dictionaries, at the cost of a slowdown.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same streaming technique could reduce peak memory for other string data structures whose construction currently requires uncompressed working space.
  • Indexing very long strings on memory-constrained hardware becomes feasible when δ remains small relative to n.
  • The method suggests possible extensions to dynamic or updateable indexes that also stay within compressed space bounds.

Load-bearing premise

The string complexity δ is known in advance and satisfies δ at least on the order of log log n.

What would settle it

An input string s with known δ where the construction space exceeds O(δ log n/δ) at any moment during the single pass or where a search for some pattern exceeds O(m + (occ + 1) log^ε n) time.

read the original abstract

Suppose that we are given a string $s$ of length $n$ over an alphabet $\{0,1,\ldots,n^{O(1)}\}$ and $\delta$ is the string complexity of $s$, a known compression measure. We describe an index on $s$ with $O(\delta\log\frac{n}{\delta})$ space, measured in $O(\log n)$-bit machine words, which can search in $s$ any string of length $m$ in $O(m + (\mathrm{occ} + 1)\log^\epsilon n)$ time, where $\mathrm{occ}$ is the number of occurrences and $\epsilon > 0$ is any fixed constant (the big-O in the space bound hides factor $\frac{1}{\epsilon}$). Crucially, the index can be built in $O(n\log n)$ expected time by one left-to-right pass on the string $s$ in a streaming fashion with $O(\delta\log\frac{n}{\delta})$ construction space. The index does not use the Karp--Rabin fingerprints, and the randomization in the construction time can be eliminated by using deterministic dictionaries instead of hash tables (with a slowdown). The search time matches currently best results and the space is almost optimal (the known optimum is $O(\delta\log\frac{n}{\delta\alpha})$, where $\alpha = \log_\sigma n$ and $\sigma$ is the alphabet size, and it coincides with $O(\delta\log\frac{n}{\delta})$ when $\delta = O(n / \alpha^2)$). This is the first index that can be constructed within such space and with such time guarantees. To avoid uninteresting marginal cases, all above bounds are stated for $\delta \ge \Omega(\log\log n)$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a compressed index for a string s of length n over a polynomial-sized alphabet, where δ denotes the string complexity of s (assumed known and at least Ω(log log n)). The index uses O(δ log(n/δ)) words of space, supports pattern matching queries for a string of length m in O(m + (occ + 1) log^ε n) time for any fixed ε > 0, and is constructed via a single left-to-right streaming pass in O(n log n) expected time using only O(δ log(n/δ)) space. The construction avoids Karp-Rabin fingerprints and can be made deterministic at the cost of a slowdown. The space is claimed to be nearly optimal, matching the known lower bound when δ = O(n / α²) for α = log_σ n.

Significance. If the stated bounds and streaming construction hold, the result is significant: it is the first index achieving simultaneous near-optimal compressed space, optimal query time, and true one-pass streaming construction in compressed space. The explicit asymptotic guarantees without additional free parameters (beyond the supplied δ and ε) and the avoidance of fingerprints strengthen the contribution. The work provides a concrete algorithmic advance in the area of compressed string indexes.

major comments (1)
  1. [Abstract] Abstract and construction description: the requirement that δ is known in advance is load-bearing for the one-pass streaming claim. All sampling rates, block sizes, and space allocations are functions of the supplied δ; an incorrect value would immediately violate the O(δ log(n/δ)) space bound or invalidate the query guarantees. The manuscript should explicitly state whether δ can be computed or approximated within the same streaming pass and space bound, or acknowledge this as a prerequisite limitation.
minor comments (2)
  1. [Abstract] The space bound hides a 1/ε factor; the construction section should clarify how the hidden constant depends on ε and whether it affects the streaming space allocation.
  2. [Abstract] Notation: define occ and ε at first use and ensure consistent use of big-O notation across the query and construction bounds.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and positive recommendation of our manuscript. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and construction description: the requirement that δ is known in advance is load-bearing for the one-pass streaming claim. All sampling rates, block sizes, and space allocations are functions of the supplied δ; an incorrect value would immediately violate the O(δ log(n/δ)) space bound or invalidate the query guarantees. The manuscript should explicitly state whether δ can be computed or approximated within the same streaming pass and space bound, or acknowledge this as a prerequisite limitation.

    Authors: We agree with the referee that the construction requires δ to be known in advance, as the algorithm's parameters are set based on this value. The paper presents δ as a known input measure, and our claims are conditional on this. Determining δ (or a suitable approximation) in a single streaming pass within the same space bound is not addressed in this work and remains an open problem. In the revised version, we will update the abstract and the construction description to explicitly acknowledge that δ is a prerequisite input to the algorithm, and that the streaming construction assumes it is supplied. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit algorithmic construction with stated assumptions

full rationale

The paper states an explicit one-pass streaming construction whose space O(δ log n/δ) and query time O(m + (occ+1) log^ε n) are expressed directly in terms of the externally supplied parameters n and δ (with δ ≥ Ω(log log n) given as a precondition). No equation or claim reduces a derived quantity to a fitted input, no self-citation is invoked as a uniqueness theorem, and no ansatz is smuggled via prior work. The requirement that δ be known in advance is an explicit modeling assumption that enables parameter setting; it does not create a definitional loop or force the claimed bounds by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard algorithmic assumptions in string data structures plus the existence of a new streaming construction; no new entities are postulated and no free parameters are fitted to data.

axioms (2)
  • standard math Alphabet size is n^{O(1)} and machine words are O(log n) bits.
    Explicitly stated in the abstract as the model for the input string s.
  • domain assumption Deterministic dictionaries can replace hash tables to remove randomization.
    Mentioned as an alternative that preserves the asymptotic bounds with a slowdown.

pith-pipeline@v0.9.0 · 5616 in / 1431 out tokens · 75267 ms · 2026-05-15T22:23:19.076465+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Dynamic Grammar-Compressed Self-Index in $\delta$-Optimal Space

    cs.DS 2026-04 unverdicted novelty 8.0

    The dynamic RR-index is the first dynamic self-index to attain δ-optimal space, with locate in expected O(m + log m log² n + occ (log n / log log n)) time and updates in O(m' log² n + log³ n) time.