pith. sign in

arxiv: 2509.21586 · v2 · submitted 2025-09-25 · 💻 cs.CR

From Indexing to Coding: A New Paradigm for Data Availability Sampling

Pith reviewed 2026-05-18 13:29 UTC · model grok-4.3

classification 💻 cs.CR
keywords data availability samplingblockchainrandom linear network codinglight nodeserasure codesdata availabilityscalabilitycryptographic commitments
0
0 comments X

The pith

Committing to uncoded data and sampling via on-the-fly random linear coding lets light nodes verify data availability with up to orders of magnitude stronger statistical assurances than fixed-rate codes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a new paradigm for data availability sampling in blockchain systems by separating the commitment to the original data from the coding process. Instead of pre-committing to coded symbols using fixed-rate erasure codes like Reed-Solomon or LDPC, the method commits to the uncoded data and generates samples on the fly through random linear combinations. The resulting samples become significantly more expressive, which in concrete implementations gives light nodes much stronger assurances that the full data is available. A sympathetic reader cares because existing DAS approaches limit security and scalability for platforms like Ethereum, and this modular change addresses the data availability problem with minimal additional assumptions.

Core claim

The paper establishes a new approach to DAS that modularizes the coding and commitment process by committing to the uncoded data while performing sampling through on-the-fly coding. The protocol realizes this using random linear network coding. The resulting samples are significantly more expressive, enabling light nodes to obtain, in concrete implementations, up to multiple orders of magnitude stronger assurances of data availability than from sampling pre-committed symbols from a fixed-rate redundancy code as done in established DAS schemes using Reed Solomon or low density parity check codes.

What carries the argument

On-the-fly random linear network coding (RLNC) performed after committing to uncoded data, replacing sampling from a predetermined set of pre-coded symbols.

Load-bearing premise

The on-the-fly random linear combinations can be generated and verified with communication and computation costs that remain practical for light nodes while still delivering the claimed statistical advantage over fixed-rate codes.

What would settle it

A prototype implementation or simulation in which light-node verification of on-the-fly samples requires more samples or higher costs than Reed-Solomon DAS to reach equivalent assurance levels, or where costs exceed practical limits for resource-constrained nodes.

Figures

Figures reproduced from arXiv: 2509.21586 by Kishori Konwar, Moritz Grundei, Muriel Medard, Vipindev Adat Vasudevan.

Figure 1
Figure 1. Figure 1: Two sampling paradigms. Top: Sampling by indexing, where samples are [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: System actors in the data availability problem: The verifier tries to ascer [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Probability that withholding a sufficient amount of samples to make the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Information exchange between the system actors in the DAS protocol [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of verifier download costs relative to the total amount of [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Parameters of a coded Merkle tree. At every layer, [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Two ways of committing 2D-RS coded data: Roots of Merkle trees (left) [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
read the original abstract

The data availability problem is a central challenge in blockchain systems and lies at the core of the accessibility and scalability issues faced by platforms such as Ethereum. Modern solutions employ several approaches, with data availability sampling (DAS) being the most self-sufficient and minimalistic in its security assumptions. Existing DAS methods typically form cryptographic commitments on codewords of fixed-rate erasure codes, which restrict light nodes to sampling from a predetermined set of coded symbols. In this paper, we introduce a new approach to DAS that modularizes the coding and commitment process by committing to the uncoded data while performing sampling through on-the-fly coding. The resulting samples are significantly more expressive, enabling light nodes to obtain, in concrete implementations, up to multiple orders of magnitude stronger assurances of data availability than from sampling pre-committed symbols from a fixed-rate redundancy code as done in established DAS schemes using Reed Solomon or low density parity check codes. We present a concrete protocol that realizes this paradigm using random linear network coding (RLNC).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a new paradigm for data availability sampling (DAS) in blockchain systems. Rather than forming cryptographic commitments on codewords of fixed-rate erasure codes (e.g., Reed-Solomon or LDPC), the approach commits directly to the uncoded data and performs sampling via on-the-fly random linear network coding (RLNC). The resulting samples are claimed to be significantly more expressive, enabling light nodes to obtain up to multiple orders of magnitude stronger statistical assurances of data availability than in established DAS schemes.

Significance. If the concrete RLNC protocol can be shown to deliver the claimed per-sample detection probability advantage while keeping verification and communication costs independent of block size and practical for light nodes, the work would represent a meaningful improvement in the security-efficiency tradeoff for blockchain light clients. The modular separation of commitment and coding is a clean conceptual contribution that could influence future DAS designs.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (protocol description): the central claim that on-the-fly RLNC yields 'up to multiple orders of magnitude stronger assurances' is stated without any quantitative comparison, detection-probability calculation, or security reduction. No table or equation shows the per-sample advantage over fixed-rate RS/LDPC symbols under realistic adversarial models.
  2. [§4] §4 (verification and cost analysis): the manuscript does not demonstrate that verification of a random linear combination (coefficients + result + proof) has communication and computation cost independent of the number of summed symbols or block dimension. If verification cost grows with the support size of the linear combination, the statistical gain is offset and the headline improvement does not materialize.
  3. [§5] §5 (security analysis): no concrete parameter choices, field size, or security proof are provided that would allow a reader to verify that the RLNC scheme achieves the stated advantage while remaining sound against a malicious block producer who controls the uncoded data.
minor comments (2)
  1. [§3] Notation for the commitment scheme and the RLNC coefficients should be introduced earlier and used consistently; several symbols appear without prior definition in the protocol section.
  2. [§2] The related-work discussion would benefit from explicit comparison of communication complexity per sample with existing DAS constructions (e.g., KZG-based or Merkle-tree DAS).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each of the major comments point by point below, indicating the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (protocol description): the central claim that on-the-fly RLNC yields 'up to multiple orders of magnitude stronger assurances' is stated without any quantitative comparison, detection-probability calculation, or security reduction. No table or equation shows the per-sample advantage over fixed-rate RS/LDPC symbols under realistic adversarial models.

    Authors: We agree with this observation. The current manuscript states the advantage qualitatively but lacks explicit calculations. In the revised version, we will add to §3 a quantitative analysis including equations for the detection probability of a single RLNC sample versus a fixed-rate symbol. Specifically, under an adversarial model where the adversary erases a fraction f of the data, the probability that a random linear combination detects the erasure is 1 - (1-f)^k where k is the number of symbols combined, leading to significantly higher detection rates for larger supports. We will include a table comparing these probabilities for typical DAS parameters (e.g., 1% vs 50% erasure rates) and discuss the orders-of-magnitude improvement in statistical assurance for light nodes. revision: yes

  2. Referee: [§4] §4 (verification and cost analysis): the manuscript does not demonstrate that verification of a random linear combination (coefficients + result + proof) has communication and computation cost independent of the number of summed symbols or block dimension. If verification cost grows with the support size of the linear combination, the statistical gain is offset and the headline improvement does not materialize.

    Authors: We appreciate this important point regarding practicality. In our design, the commitment is to the raw data using a vector commitment scheme that supports homomorphic linear combinations. Verification involves checking the proof for the combined value, which has constant size independent of the support size of the linear combination. The coefficients are generated pseudorandomly from a seed, so only the seed needs to be communicated, keeping communication cost low. We will expand §4 with a formal cost analysis table showing that both computation and communication remain independent of block dimension and support size, relying on the properties of the underlying commitment scheme. revision: yes

  3. Referee: [§5] §5 (security analysis): no concrete parameter choices, field size, or security proof are provided that would allow a reader to verify that the RLNC scheme achieves the stated advantage while remaining sound against a malicious block producer who controls the uncoded data.

    Authors: We acknowledge that the security analysis in §5 is currently at a high level. To address this, the revised manuscript will include concrete parameter recommendations, such as using a finite field of size 2^32 or larger to minimize collision probabilities, and a detailed security argument. We will provide a reduction showing that if the commitment scheme is binding and the RLNC is performed over a sufficiently large field, the probability of a malicious producer fooling the sampler is negligible. A full formal proof will be outlined in an appendix, with the main text providing the key steps and parameter choices that achieve the claimed advantage. revision: yes

Circularity Check

0 steps flagged

No circularity: paradigm shift rests on independent protocol construction

full rationale

The paper proposes committing to uncoded data and performing on-the-fly RLNC sampling, claiming stronger statistical assurances than fixed-rate RS/LDPC schemes. No equations, fitted parameters, or self-citations are exhibited that reduce the claimed orders-of-magnitude gain to a quantity defined by the authors' own prior work or by construction. The central advantage is presented as arising directly from the modular separation of commitment and coding, which is externally verifiable against standard erasure-code baselines without requiring the present paper's fitted values or uniqueness theorems. This is the expected non-finding for a conceptual protocol paper whose load-bearing step is the concrete RLNC verification cost, not a self-referential derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the protocol is described at the level of a high-level paradigm shift.

pith-pipeline@v0.9.0 · 5709 in / 1135 out tokens · 27279 ms · 2026-05-18T13:29:44.260980+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

  1. [1]

    In: 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton)

    Abdrashitov, V., M´ edard, M.: Durable network coded distributed storage. In: 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton). pp. 851–856. IEEE (2015)

  2. [2]

    Fraud and Data Availability Proofs: Maximising Light Client Security and Scaling Blockchains with Dishonest Majorities

    Al-Bassam, M., Sonnino, A., Buterin, V.: Fraud and data availability proofs: Max- imising light client security and scaling blockchains with dishonest majorities. arXiv preprint arXiv:1809.09044 (2018)

  3. [3]

    Cryptology ePrint Archive (2025)

    Boneh, D., Neu, J., Nikolaenko, V., Partap, A.: Data availability sampling with repair. Cryptology ePrint Archive (2025)

  4. [4]

    In: 2018 IEEE symposium on security and privacy (SP)

    B¨ unz, B., Bootle, J., Boneh, D., Poelstra, A., Wuille, P., Maxwell, G.: Bulletproofs: Short proofs for confidential transactions and more. In: 2018 IEEE symposium on security and privacy (SP). pp. 315–334. IEEE (2018)

  5. [5]

    In: International conference on the theory and applica- tion of cryptology and information security

    B¨ unz, B., Maller, M., Mishra, P., Tyagi, N., Vesely, P.: Proofs for inner pairing products and applications. In: International conference on the theory and applica- tion of cryptology and information security. pp. 65–97. Springer (2021)

  6. [6]

    In: Workshop on Network Coding, Theory and Applications (2005) From Indexing to Coding: A New Paradigm for Data Availability Sampling 15

    Deb, S.: How good is random linear coding based distributed networked storage. In: Workshop on Network Coding, Theory and Applications (2005) From Indexing to Coding: A New Paradigm for Data Availability Sampling 15

  7. [7]

    Cryptology ePrint Archive (2023)

    Hall-Andersen, M., Simkin, M., Wagner, B.: Foundations of data availability sam- pling. Cryptology ePrint Archive (2023)

  8. [8]

    IEEE Transactions on information theory52(10), 4413–4430 (2006)

    Ho, T., M´ edard, M., Koetter, R., Karger, D.R., Effros, M., Shi, J., Leong, B.: A random linear network coding approach to multicast. IEEE Transactions on information theory52(10), 4413–4430 (2006)

  9. [9]

    In: International conference on the theory and appli- cation of cryptology and information security

    Kate, A., Zaverucha, G.M., Goldberg, I.: Constant-size commitments to polyno- mials and their applications. In: International conference on the theory and appli- cation of cryptology and information security. pp. 177–194. Springer (2010)

  10. [10]

    John Wiley & Sons (2025)

    M´ edard, M., Vasudevan, V.A., Pedersen, M.V., Duffy, K.R.: Network Coding for Engineers. John Wiley & Sons (2025)

  11. [11]

    IEEE Transac- tions on Communications70(9), 5742–5759 (2022)

    Mitra, D., Tauz, L., Dolecek, L.: Overcoming data availability attacks in blockchain systems: Short code-length ldpc code design for coded merkle tree. IEEE Transac- tions on Communications70(9), 5742–5759 (2022)

  12. [12]

    Technical re- port, Paradigm (2022)

    Neu, J.: Data availability sampling: From basics to open problems. Technical re- port, Paradigm (2022)

  13. [13]

    Nikolaenko, V., Boneh, D.: Data availability sampling and danksharding: An overview and a proposal for improvements (2023)

  14. [14]

    Cambridge University Press (2006)

    Roth, R.: Introduction to coding theory. Cambridge University Press (2006)

  15. [15]

    Cryptology ePrint Archive (2024)

    Wagner, B., Zapico, A.: A documentation of ethereum’s peerdas. Cryptology ePrint Archive (2024)

  16. [16]

    Yu, M., Kannan, S., Viswanath, P., Li, S., Avestimehr, A.S.: Determining data availability,https://patents.google.com/patent/US11533127B1/en

  17. [17]

    Yu, M., Kannan, S., Viswanath, P., Li, S., Avestimehr, A.S.: Verifying a set of remotely stored data,https://patentimages.storage.googleapis.com/f4/3f/ e9/6fb900187dc974/US11356274.pdf

  18. [18]

    In: International Con- ference on Financial Cryptography and Data Security

    Yu, M., Sahraei, S., Li, S., Avestimehr, S., Kannan, S., Viswanath, P.: Coded merkle tree: Solving data availability attacks in blockchains. In: International Con- ference on Financial Cryptography and Data Security. pp. 114–134. Springer (2020) A Proof of Soundness Theorem For soundness, some data entity must be reconstructible from the emitted coded vec...

  19. [19]

    These are usually denoted byc(variable-node degree) andd(check-node degree)

    In a regular LDPC code, the variable node degree (the number of parity- check equations in which each coded symbol, or variable node, participates) is constant across all variable nodes, and the check node degree (the number of coded symbols connected to each parity-check equation) is likewise constant. These are usually denoted byc(variable-node degree) ...

  20. [20]

    Such codes are typically characterized by their average variable-node degree and check-node degree

    In an irregular LDPC code, the variable-node and check-node degrees are not constant but instead follow a degree distribution. Such codes are typically characterized by their average variable-node degree and check-node degree. The LDPC codes used in [11, 18] are a special type of systematic, irregular LDPC codes with rater= 0.25, meaning that the firstkco...

  21. [21]

    Different approaches exist for chunking the original data into shares, encod- ing it, and batching it again

    first introduced RS codes as component codes of a 2D tensor code for DAS. Different approaches exist for chunking the original data into shares, encod- ing it, and batching it again. In what follows, we describe a generalized version of 2D-RS-based DAS schemes that captures the key properties of established methods, particularly with respect to sampling c...