pith. sign in

arxiv: 2607.01636 · v1 · pith:5EK2MTGKnew · submitted 2026-07-02 · 💻 cs.DS

Output-Sensitive Construction of CDAWGs from BWT-Runs

Pith reviewed 2026-07-03 04:26 UTC · model grok-4.3

classification 💻 cs.DS
keywords CDAWGBWT-runscompressed suffix treeoutput-sensitive constructionstring algorithmsdirected acyclic word graphBurrows-Wheeler transformsuffix tree
0
0 comments X

The pith

CDAWG of a string can be built from its BWT-runs in output-sensitive time and space using a given compressed suffix tree.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an algorithm that constructs the compact directed acyclic word graph (CDAWG) of a reversed input string of length n. The CDAWG has e_L edges and is built in O(e_L log n log(n/r)) time using O(r log(n/r) + e_L) words of space, where r is the number of runs in the Burrows-Wheeler transform. The construction takes a fully functional compressed suffix tree of size O(r log(n/r)) as a black-box input. It works by treating the CDAWG in two ways at once: as an edge-compacted version of the DAWG and as a suffix tree whose nodes with isomorphic subtrees have been merged. A reader would care because the time and space depend on the output size and the compressibility measured by r rather than on n directly.

Core claim

By exploiting these two views in opposite directions, we show how to build, for the (reversed) input string of length n, the CDAWG with e_L edges in O(e_L log n log(n/r)) time with O(r log(n/r)+e_L) words of working space, provided that the fully functional compressed suffix tree of Gagie, Navarro, and Prezza of size O(r log(n/r)) is available. Here, r denotes the number of BWT-runs of the input string.

What carries the argument

The pair of equivalent definitions of the CDAWG (edge-compacted DAWG versus suffix tree with merged isomorphic subtrees) used in opposite directions, together with the compressed suffix tree supplied as input.

If this is right

  • Construction time scales with the number of CDAWG edges rather than string length.
  • Working space is linear in the number of BWT-runs plus the final CDAWG size.
  • The algorithm applies directly to the reversed string.
  • All steps remain within the stated bounds once the compressed suffix tree is provided.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the compressed suffix tree itself can be built from the BWT-runs in comparable time, the whole pipeline would run from the BWT alone.
  • The same opposite-direction exploitation of the two CDAWG views might apply to related automata such as the suffix automaton.
  • Strings with very small r and small e_L relative to n would see the largest practical speed-ups in memory-constrained settings.

Load-bearing premise

A fully functional compressed suffix tree of size O(r log(n/r)) is supplied as a ready black-box input.

What would settle it

A concrete string whose CDAWG has e_L edges yet requires either more than O(e_L log n log(n/r)) time or more than O(r log(n/r) + e_L) words to construct even when the compressed suffix tree is already given.

Figures

Figures reproduced from arXiv: 2607.01636 by Hiroki Arimura, Shunsuke Inenaga, Yuta Tsuruzono.

Figure 1
Figure 1. Figure 1: Illustration of the interval computation for a soft Weiner link from a maximal repeat w. The link by a first reaches the locus of aw, and its right closure is q = −→aw with range(aw) = range(q) = [L, R]. The maximal-repeat node h = ←−q = ←→aw is found by the left-closure search. Along the valid part of this search, the SA-interval width remains R − L + 1; after the maximal left growth, the next candidate l… view at source ↗
read the original abstract

The compact directed acyclic word graph (CDAWG) of a string can be viewed in two equivalent ways: as the edge-compacted DAWG of the string, and as the DAG obtained from the suffix tree by merging the nodes whose subtrees are isomorphic. By exploiting these two views in opposite directions, we show how to build, for the (reversed) input string of length $n$, the CDAWG with $e_L$ edges in $O(e_L\log n\log(n/r))$ time with $O(r\log(n/r)+e_L)$ words of working space, provided that the fully functional compressed suffix tree of Gagie, Navarro, and Prezza of size $O(r\log(n/r))$ is available. Here, $r$ denotes the number of BWT-runs of the input string.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 1 minor

Summary. The paper claims an algorithm to construct the CDAWG (with e_L edges) of the reversed input string of length n in O(e_L log n log(n/r)) time and O(r log(n/r) + e_L) words of space, assuming as a black-box input the fully functional compressed suffix tree of Gagie, Navarro, and Prezza (size O(r log(n/r))), where r is the number of BWT-runs.

Significance. If the claimed bounds and construction are correct, the result is significant for compressed string processing: it yields an output-sensitive CDAWG construction whose resources are expressed directly in terms of the output size e_L and the BWT-run count r, leveraging an existing CST as a modular black box. This is a strength when the CST is already available.

minor comments (1)
  1. [Abstract] Abstract: the time and space bounds are stated without any derivation, proof sketch, pseudocode, or high-level algorithmic outline, which prevents verification of the central claim from the provided text alone.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for reviewing our manuscript and for the positive assessment of its potential significance for compressed string processing, conditional on the correctness of the claimed bounds. The recommendation is listed as uncertain, yet the report contains no specific major comments or questions. We therefore provide no point-by-point responses below. Should the referee have particular concerns about the construction, the time or space analysis, or the use of the Gagie-Navarro-Prezza CST as a black box, we are happy to address them in a revised version or in a separate reply.

Circularity Check

0 steps flagged

No significant circularity; explicit external black-box assumption

full rationale

The paper's central claim is an output-sensitive algorithmic construction whose time and space bounds are stated directly in terms of the output size e_L and the size of an explicitly assumed external input (the Gagie-Navarro-Prezza CST of size O(r log(n/r))). No equations, reductions, or predictions are shown that collapse to the inputs by definition or by self-citation. The assumption is load-bearing but external and independent; the derivation therefore remains self-contained against the stated precondition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the equivalence of the two CDAWG views and the availability of the cited compressed suffix tree; these are domain assumptions rather than new entities or fitted parameters.

axioms (2)
  • domain assumption The CDAWG admits two equivalent characterizations: edge-compacted DAWG and suffix tree with merged isomorphic subtrees.
    Invoked in the first sentence of the abstract as the basis for exploiting the views in opposite directions.
  • domain assumption A fully functional compressed suffix tree of size O(r log(n/r)) can be used as a black box.
    Explicitly stated as a prerequisite in the complexity claim.

pith-pipeline@v0.9.1-grok · 5675 in / 1515 out tokens · 34707 ms · 2026-07-03T04:26:02.595563+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

9 extracted references · 8 canonical work pages

  1. [1]

    In: String Processing and Information Retrieval, SPIRE 2017

    Belazzougui, D., Cunial, F.: Fast label extraction in the CDAWG. In: String Processing and Information Retrieval, SPIRE 2017. Lecture Notes in Computer Science, vol. 10508, pp. 161–175. Springer (2017).https://doi.org/10.1007/ 978-3-319-67428-5_14

  2. [3]

    Bender, M.A., Farach-Colton, M.: The level ancestor problem simplified. Theor. Comput. Sci.321(1), 5–12 (2004).https://doi.org/10.1016/J.TCS.2003.05.002

  3. [4]

    Theoretical Computer Science40, 31–55 (1985).https://doi.org/10.1016/0304-3975(85)90157-4

    Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T., Seiferas, J.: The smallest automaton recognizing the subwords of a text. Theoretical Computer Science40, 31–55 (1985).https://doi.org/10.1016/0304-3975(85)90157-4

  4. [5]

    Journal of the ACM34(3), 578–595 (1987).https://doi.org/10.1145/28869.28873

    Blumer, A., Blumer, J., Haussler, D., McConnell, R.M., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. Journal of the ACM34(3), 578–595 (1987).https://doi.org/10.1145/28869.28873

  5. [6]

    Journal of the ACM67(1), 2:1–2:54 (2020)

    Gagie, T., Navarro, G., Prezza, N.: Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM67(1), 2:1–2:54 (2020). https://doi.org/10.1145/3375890

  6. [7]

    In: CPM 2026

    Kimura, K., I, T.: R-enum revisited: Speedup and extension for context-sensitive re- peats and net frequencies. In: CPM 2026. vol. 369, pp. 10:1–10:14. Schloss Dagstuhl– Leibniz-Zentrum für Informatik (2026).https://doi.org/10.4230/LIPIcs.CPM. 2026.10

  7. [8]

    Algorithmica79(2), 291–318 (2017).https://doi.org/10.1007/s00453-016-0178-z

    Narisawa, K., Hiratsuka, H., Inenaga, S., Bannai, H., Takeda, M.: Efficient com- putation of substring equivalence classes with suffix arrays. Algorithmica79(2), 291–318 (2017).https://doi.org/10.1007/s00453-016-0178-z

  8. [9]

    In: 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021

    Nishimoto, T., Tabei, Y.: R-enum: Enumeration of characteristic substrings in BWT-runs bounded space. In: 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021. Leibniz International Proceedings in Informatics, vol. 191, pp. 21:1–21:21. Schloss Dagstuhl–Leibniz-Zentrum für Informatik (2021).https: //doi.org/10.4230/LIPIcs.CPM.2021.21

  9. [10]

    In: 14th Annual Symposium on Switching and Automata Theory

    Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Switching and Automata Theory. pp. 1–11. IEEE Computer Society (1973). https://doi.org/10.1109/SWAT.1973.13