Output-Sensitive Construction of CDAWGs from BWT-Runs
Pith reviewed 2026-07-03 04:26 UTC · model grok-4.3
The pith
CDAWG of a string can be built from its BWT-runs in output-sensitive time and space using a given compressed suffix tree.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By exploiting these two views in opposite directions, we show how to build, for the (reversed) input string of length n, the CDAWG with e_L edges in O(e_L log n log(n/r)) time with O(r log(n/r)+e_L) words of working space, provided that the fully functional compressed suffix tree of Gagie, Navarro, and Prezza of size O(r log(n/r)) is available. Here, r denotes the number of BWT-runs of the input string.
What carries the argument
The pair of equivalent definitions of the CDAWG (edge-compacted DAWG versus suffix tree with merged isomorphic subtrees) used in opposite directions, together with the compressed suffix tree supplied as input.
If this is right
- Construction time scales with the number of CDAWG edges rather than string length.
- Working space is linear in the number of BWT-runs plus the final CDAWG size.
- The algorithm applies directly to the reversed string.
- All steps remain within the stated bounds once the compressed suffix tree is provided.
Where Pith is reading between the lines
- If the compressed suffix tree itself can be built from the BWT-runs in comparable time, the whole pipeline would run from the BWT alone.
- The same opposite-direction exploitation of the two CDAWG views might apply to related automata such as the suffix automaton.
- Strings with very small r and small e_L relative to n would see the largest practical speed-ups in memory-constrained settings.
Load-bearing premise
A fully functional compressed suffix tree of size O(r log(n/r)) is supplied as a ready black-box input.
What would settle it
A concrete string whose CDAWG has e_L edges yet requires either more than O(e_L log n log(n/r)) time or more than O(r log(n/r) + e_L) words to construct even when the compressed suffix tree is already given.
Figures
read the original abstract
The compact directed acyclic word graph (CDAWG) of a string can be viewed in two equivalent ways: as the edge-compacted DAWG of the string, and as the DAG obtained from the suffix tree by merging the nodes whose subtrees are isomorphic. By exploiting these two views in opposite directions, we show how to build, for the (reversed) input string of length $n$, the CDAWG with $e_L$ edges in $O(e_L\log n\log(n/r))$ time with $O(r\log(n/r)+e_L)$ words of working space, provided that the fully functional compressed suffix tree of Gagie, Navarro, and Prezza of size $O(r\log(n/r))$ is available. Here, $r$ denotes the number of BWT-runs of the input string.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims an algorithm to construct the CDAWG (with e_L edges) of the reversed input string of length n in O(e_L log n log(n/r)) time and O(r log(n/r) + e_L) words of space, assuming as a black-box input the fully functional compressed suffix tree of Gagie, Navarro, and Prezza (size O(r log(n/r))), where r is the number of BWT-runs.
Significance. If the claimed bounds and construction are correct, the result is significant for compressed string processing: it yields an output-sensitive CDAWG construction whose resources are expressed directly in terms of the output size e_L and the BWT-run count r, leveraging an existing CST as a modular black box. This is a strength when the CST is already available.
minor comments (1)
- [Abstract] Abstract: the time and space bounds are stated without any derivation, proof sketch, pseudocode, or high-level algorithmic outline, which prevents verification of the central claim from the provided text alone.
Simulated Author's Rebuttal
We thank the referee for reviewing our manuscript and for the positive assessment of its potential significance for compressed string processing, conditional on the correctness of the claimed bounds. The recommendation is listed as uncertain, yet the report contains no specific major comments or questions. We therefore provide no point-by-point responses below. Should the referee have particular concerns about the construction, the time or space analysis, or the use of the Gagie-Navarro-Prezza CST as a black box, we are happy to address them in a revised version or in a separate reply.
Circularity Check
No significant circularity; explicit external black-box assumption
full rationale
The paper's central claim is an output-sensitive algorithmic construction whose time and space bounds are stated directly in terms of the output size e_L and the size of an explicitly assumed external input (the Gagie-Navarro-Prezza CST of size O(r log(n/r))). No equations, reductions, or predictions are shown that collapse to the inputs by definition or by self-citation. The assumption is load-bearing but external and independent; the derivation therefore remains self-contained against the stated precondition.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The CDAWG admits two equivalent characterizations: edge-compacted DAWG and suffix tree with merged isomorphic subtrees.
- domain assumption A fully functional compressed suffix tree of size O(r log(n/r)) can be used as a black box.
Reference graph
Works this paper leans on
-
[1]
In: String Processing and Information Retrieval, SPIRE 2017
Belazzougui, D., Cunial, F.: Fast label extraction in the CDAWG. In: String Processing and Information Retrieval, SPIRE 2017. Lecture Notes in Computer Science, vol. 10508, pp. 161–175. Springer (2017).https://doi.org/10.1007/ 978-3-319-67428-5_14
2017
-
[3]
Bender, M.A., Farach-Colton, M.: The level ancestor problem simplified. Theor. Comput. Sci.321(1), 5–12 (2004).https://doi.org/10.1016/J.TCS.2003.05.002
-
[4]
Theoretical Computer Science40, 31–55 (1985).https://doi.org/10.1016/0304-3975(85)90157-4
Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M.T., Seiferas, J.: The smallest automaton recognizing the subwords of a text. Theoretical Computer Science40, 31–55 (1985).https://doi.org/10.1016/0304-3975(85)90157-4
-
[5]
Journal of the ACM34(3), 578–595 (1987).https://doi.org/10.1145/28869.28873
Blumer, A., Blumer, J., Haussler, D., McConnell, R.M., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. Journal of the ACM34(3), 578–595 (1987).https://doi.org/10.1145/28869.28873
-
[6]
Journal of the ACM67(1), 2:1–2:54 (2020)
Gagie, T., Navarro, G., Prezza, N.: Fully functional suffix trees and optimal text searching in BWT-runs bounded space. Journal of the ACM67(1), 2:1–2:54 (2020). https://doi.org/10.1145/3375890
-
[7]
Kimura, K., I, T.: R-enum revisited: Speedup and extension for context-sensitive re- peats and net frequencies. In: CPM 2026. vol. 369, pp. 10:1–10:14. Schloss Dagstuhl– Leibniz-Zentrum für Informatik (2026).https://doi.org/10.4230/LIPIcs.CPM. 2026.10
-
[8]
Algorithmica79(2), 291–318 (2017).https://doi.org/10.1007/s00453-016-0178-z
Narisawa, K., Hiratsuka, H., Inenaga, S., Bannai, H., Takeda, M.: Efficient com- putation of substring equivalence classes with suffix arrays. Algorithmica79(2), 291–318 (2017).https://doi.org/10.1007/s00453-016-0178-z
-
[9]
In: 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021
Nishimoto, T., Tabei, Y.: R-enum: Enumeration of characteristic substrings in BWT-runs bounded space. In: 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021. Leibniz International Proceedings in Informatics, vol. 191, pp. 21:1–21:21. Schloss Dagstuhl–Leibniz-Zentrum für Informatik (2021).https: //doi.org/10.4230/LIPIcs.CPM.2021.21
-
[10]
In: 14th Annual Symposium on Switching and Automata Theory
Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Switching and Automata Theory. pp. 1–11. IEEE Computer Society (1973). https://doi.org/10.1109/SWAT.1973.13
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.