pith. sign in

arxiv: 2601.06493 · v2 · submitted 2026-01-10 · 💻 cs.IT · math.CO· math.IT

On the Number of Subsequences in the Nonbinary Deletion Channel

Pith reviewed 2026-05-16 15:47 UTC · model grok-4.3

classification 💻 cs.IT math.COmath.IT
keywords subsequencesdeletion channelnon-binary stringsrunsmaximum numberpolynomial timeinformation theory
0
0 comments X

The pith

A family of r-run non-binary strings maximizes subsequences after t deletions, with the count computable in polynomial time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates the number of subsequences that can be obtained from non-binary strings after undergoing t deletions. It concentrates on strings consisting of exactly r runs of identical symbols. The authors identify a specific family of these r-run strings that yields the highest possible number of subsequences for any given t. Additionally, they provide a method to calculate this maximum value efficiently in polynomial time.

Core claim

We characterize a family of r-run non-binary strings with the maximum number of subsequences under any t deletions, and show that this number can be computed in polynomial time.

What carries the argument

The family of r-run non-binary strings characterized by run-count properties that achieve the maximum subsequence count after deletions.

If this is right

  • The maximum number of subsequences for any r-run string under t deletions is given by this family.
  • This maximum can be found using an efficient polynomial-time procedure.
  • Improved bounds on subsequence counts are obtained for non-binary r-run strings.
  • These results apply directly to analyzing the deletion channel for non-binary alphabets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This characterization may aid in constructing error-correcting codes optimized for deletion channels with larger alphabets.
  • Similar run-based maximization might be explored for other channel models or error types.
  • The polynomial-time computability suggests practical applications in sequence analysis for large parameters.

Load-bearing premise

The family achieving the maximum subsequence count depends only on the number of runs and can be identified without regard to particular symbol assignments or alphabet size variations.

What would settle it

An r-run non-binary string outside the characterized family that has more subsequences than the maximum claimed for some specific t would disprove the result.

Figures

Figures reproduced from arXiv: 2601.06493 by Fang-Wei Fu, Han Li, Xiang Wang.

Figure 1
Figure 1. Figure 1: Comparison of upper bounds for the case q = 3, n = 120, r = 24. Our upper bounds derived from balanced strings [Corollary 24] are compared against previous best known bounds. [L] marks the upper bound proven by Levenstein [9]. [HR] marks the upper bound proven by Hirschberg and Regnier [1]. The results are presented as functions of t on a logarithmic scale. for all possible sequences Z (at most n−t+q−2 q−2… view at source ↗
read the original abstract

In the deletion channel, an important problem is to determine the number of subsequences derived from a string $U$ of length $n$ when subjected to $t$ deletions. It is well-known that the number of subsequences in the setting exhibits a strong dependence on the number of runs in the string $U$, where a run is defined as a maximal substring of identical characters. In this paper we study the number of subsequences of a non-binary string in this scenario, and propose some improved bounds on the number of subsequences of $r$-run non-binary strings. Specifically, we characterize a family of $r$-run non-binary strings with the maximum number of subsequences under any $t$ deletions, and show that this number can be computed in polynomial time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper studies the number of distinct subsequences of length n-t obtained from an r-run non-binary string after t deletions. It characterizes a specific family of r-run strings that achieve the maximum number of such subsequences for any t, proposes improved bounds on this quantity, and shows that the maximum can be computed in polynomial time from the run-length properties.

Significance. If the characterization is correct, the work supplies a precise combinatorial description of subsequence-maximizing strings in the non-binary deletion channel and an efficient algorithm for the maximum count. This strengthens the known dependence of subsequence numbers on run structure and could inform code design or reconstruction algorithms that exploit run-length statistics. The polynomial-time claim is a concrete algorithmic contribution.

major comments (2)
  1. [Abstract] Abstract and characterization: the claim that the maximum depends only on the run-length vector assumes an alphabet large enough to assign distinct symbols to all r runs. For |Σ| < r, non-adjacent runs may receive the same symbol, causing effective merging after deletions that skip intervening runs and strictly lowering the subsequence count relative to the claimed maximizer. This assumption is load-bearing for both the characterization and the polynomial-time result.
  2. [Characterization] Characterization and complexity sections: the polynomial-time algorithm is asserted to compute the maximum from run properties alone, but without an explicit statement of how symbol collisions are avoided or bounded, it is unclear whether the DP or recurrence remains valid for arbitrary alphabet sizes. A concrete counter-example or proof that the optimum is always attained with distinct symbols (or a correction for small alphabets) is needed.
minor comments (1)
  1. [Abstract] The abstract states that improved bounds are proposed but does not indicate whether they follow directly from the characterization or require separate arguments; a short remark linking the two would clarify the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. The points raised correctly identify that our characterization and algorithmic result rely on an alphabet large enough to assign distinct symbols to runs. We will revise the manuscript to state this assumption explicitly and add supporting discussion. Point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract and characterization: the claim that the maximum depends only on the run-length vector assumes an alphabet large enough to assign distinct symbols to all r runs. For |Σ| < r, non-adjacent runs may receive the same symbol, causing effective merging after deletions that skip intervening runs and strictly lowering the subsequence count relative to the claimed maximizer. This assumption is load-bearing for both the characterization and the polynomial-time result.

    Authors: We agree that the result assumes |Σ| ≥ r so that distinct symbols can be assigned to the r runs. This is the natural setting for maximizing the number of subsequences, as any symbol repetition between non-adjacent runs reduces the count by creating effective merges after deletions. We will revise the abstract and introduction to state the assumption explicitly: “for an alphabet of size at least r.” We will also add a brief remark noting that for |Σ| < r the maximum is strictly smaller and depends on the optimal symbol assignment. revision: yes

  2. Referee: [Characterization] Characterization and complexity sections: the polynomial-time algorithm is asserted to compute the maximum from run properties alone, but without an explicit statement of how symbol collisions are avoided or bounded, it is unclear whether the DP or recurrence remains valid for arbitrary alphabet sizes. A concrete counter-example or proof that the optimum is always attained with distinct symbols (or a correction for small alphabets) is needed.

    Authors: The DP and recurrence compute the maximum under the distinct-symbol assignment, which is always optimal when |Σ| ≥ r. We will insert a short lemma proving that any optimal string can be relabeled with distinct symbols without decreasing the subsequence count (by the pigeonhole principle and the effect of deletions on run merging). For |Σ| < r we will note that the algorithm requires an additional optimization step over symbol assignments; a small counter-example (binary alphabet, r=3) will be added to illustrate the gap. These clarifications will be placed in the characterization section. revision: yes

Circularity Check

0 steps flagged

No circularity: direct combinatorial characterization from run-length properties

full rationale

The paper's central result is a combinatorial characterization of the r-run non-binary strings maximizing the number of length-(n-t) subsequences for any t, together with a polynomial-time algorithm to compute that number. No equation or definition in the provided abstract reduces the claimed maximum to a fitted parameter, a self-referential quantity, or a result imported solely via self-citation. The derivation is presented as proceeding from the standard run-length decomposition of strings and explicit counting arguments over deletion patterns; these inputs are independent of the final characterization. The skeptic concern about symbol-assignment constraints is a question of proof completeness rather than circularity, as the paper does not define the maximum in terms of itself or rename a fitted quantity as a prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the standard combinatorial fact that subsequence counts in deletion channels are governed by run structure; no free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • domain assumption The number of subsequences exhibits a strong dependence on the number of runs in the string U.
    Explicitly stated as well-known in the abstract.

pith-pipeline@v0.9.0 · 5429 in / 1155 out tokens · 46094 ms · 2026-05-16T15:47:13.612477+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Tight bounds on the number of string subsequences,

    D. S. Hirschberg and M. Regnier, “Tight bounds on the number of string subsequences,”J. Discrete Algorithms, vol. 1, no. 1, pp. 123-132, 2000

  2. [2]

    Sequence Reconstruction over 3-Deletion Channels,

    D. Zhang, G. Ge, and Y . Zhang, “Sequence Reconstruction over 3-Deletion Channels,”InProc. Int. Symp. Inform. Theory, pp. 891-896, 2024

  3. [3]

    On the number of subsequences when deleting symbols from a string,

    H. Mercier, M. Khabbazian, and V . K. Bhargava, “On the number of subsequences when deleting symbols from a string,”IEEE Trans. Inf. Theory, vol. 54, no. 7, pp. 3279-3285, Jul. 2008

  4. [4]

    A survey of error-correcting codes for channels with symbol synchronization errors,

    H. Mercier, V . K. Bhargava, and V . Tarokh, “A survey of error-correcting codes for channels with symbol synchronization errors,”IEEE Commun. Surveys Tuts., vol. 12, no. 1, pp. 87-96, Feb. 2010

  5. [5]

    On the zero-error capacity threshold for deletion channels,

    I. A. Kash, M. Mitzenmacher, J. Thaler, and J. Ullman, “On the zero-error capacity threshold for deletion channels,” inProc. Inf. Theory Appl. Workshop (ITA), 2011, pp. 1-5

  6. [6]

    Some general results of coding theory with applications to the study of codes for the correction of synchronization errors,

    L. Calabi and W. E. Hartnett, “Some general results of coding theory with applications to the study of codes for the correction of synchronization errors,” Inf. Control, vol. 15, no. 3, pp. 235-249, Sep. 1969

  7. [7]

    A survey of results for deletion channels and related synchronization channels,

    M. Mitzenmacher, “A survey of results for deletion channels and related synchronization channels,”Probab. Surv., vol. 6, pp. 1-33, 2009

  8. [8]

    Sequence reconstruction over the deletion channel,

    R. Gabrys and E. Yaakobi, “Sequence reconstruction over the deletion channel,”IEEE Trans. Inf. Theory, vol. 64, no. 4, pp. 2924-2931, 2018

  9. [9]

    Binary codes capable of correcting deletions, insertions, and reversals,

    V . I. Levenshtein, “Binary codes capable of correcting deletions, insertions, and reversals,”Soviet Phys. Doklady, vol. 10, no. 8, pp. 707-710, 1966

  10. [10]

    Efficient reconstruction of sequences from their subsequences or supersequences,

    V . I. Levenshtein, “Efficient reconstruction of sequences from their subsequences or supersequences,”Journal of Combin. Theory, Ser. A, vol. 93, no.2, pp. 310-332, 2001

  11. [11]

    Efficient reconstruction of sequences,

    V . I. Levenshtein, “Efficient reconstruction of sequences,”IEEE Trans. Inf. Theory, vol. 47, no. 1, pp. 2-22, 2001

  12. [12]

    Sequence reconstruction problem for deletion channels: a complete asymptotic solution,

    V . L. P. Pham, K. Goyal, and H. M. Kiah, “Sequence reconstruction problem for deletion channels: a complete asymptotic solution,” InProc. IEEE Int. Symp. Inform. Theory, pp. 992-997, 2022

  13. [13]

    Sequence reconstruction problem for deletion channels: a complete asymptotic solution,

    V . L. P. Pham, K. Goyal, and H. M. Kiah, “Sequence reconstruction problem for deletion channels: a complete asymptotic solution,”Journal of Combin. Theory, Ser. A, https://doi.org/10.1016/j.jcta.2024.105980, 2025

  14. [14]

    A characterization of the number of subsequences obtained via the deletion channel,

    Y . Liron and M. Langberg, “A characterization of the number of subsequences obtained via the deletion channel,”IEEE Trans. Inf. Theory, vol. 61, no. 5, pp. 2300-2312, 2015

  15. [15]

    Sequence reconstruction under single-burst insertion/deletion/edit channel,

    Y . Sun, Y . Xi, and G. Ge, “Sequence reconstruction under single-burst insertion/deletion/edit channel,”IEEE Trans. Inf. Theory, vol. 69, no. 7, pp. 4466-4483, 2023

  16. [16]

    On the size distribution of the fixed-length Levenshtein balls with radius one,

    G. Wang and Q. Wang, “On the size distribution of the fixed-length Levenshtein balls with radius one,”Designs, Codes and Cryptography, vol. 92, pp. 2253-2265, 2024

  17. [17]

    Sequence reconstruction under channels with multiple bursts of insertions or deletions,

    Z. Lan, Y . Sun, W. Yu and G. Ge, “Sequence reconstruction under channels with multiple bursts of insertions or deletions,”IEEE Trans. Inf. Theory, vol. 72, no. 1, pp. 315-330, 2026

  18. [18]

    On the size of balls and anticodes of small diameter under the fixed-length Levenshtein metric,

    D. Bar-Lev, T. Etzion, and E. Yaakobi, “On the size of balls and anticodes of small diameter under the fixed-length Levenshtein metric,”IEEE Trans. Inf. Theory, vol. 69, no. 4, pp. 2324-2340, 2023

  19. [19]

    The size of Levenshtein ball with radius 2: expectation and concentration bound,

    L. He and M. Ye, “The size of Levenshtein ball with radius 2: expectation and concentration bound,” in Proceeding of theInternational Symposium on Information Theory (ISIT), Taipei, Taiwan, 2023, pp. 850-855