pith. sign in

arxiv: 2604.17828 · v1 · submitted 2026-04-20 · 💻 cs.CL

How Non-Linguistic Is the Indus Sign System? A Synthetic-Baseline Scorecard

Pith reviewed 2026-05-10 04:18 UTC · model grok-4.3

classification 💻 cs.CL
keywords Indus scriptnon-linguistic sign systemssynthetic baselinesstatistical metricsZipf distributionconditional entropyancient inscriptionsmulti-metric analysis
0
0 comments X

The pith

The Indus sign system occupies an intermediate statistical position unmatched by either heraldic or administrative non-linguistic baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a scorecard that pits the Indus inscriptions against two families of computer-generated non-linguistic texts, one styled after heraldic emblems and the other after administrative codes. Both families are tuned to match real-world frequency distributions, bigram patterns, and positional rules drawn from six attested non-linguistic corpora. Four properties are measured on 1,916 deduplicated Indus inscriptions: text length, formulaic repetition, rate of unique signs, and positional rigidity. The Indus data lands between the two synthetic families on every metric and also differs from seven real non-linguistic systems examined. The result is that no baseline reproduces the full Indus profile, including its Zipf slope of -1.49 and conditional entropy of 3.23 bits.

Core claim

The Indus corpus does not match either synthetic baseline cleanly and occupies an intermediate position relative to the heraldic and administrative families across the four metrics; neither generator family can reproduce all four properties simultaneously, and no attested real-world non-linguistic system matches the complete statistical profile either.

What carries the argument

A multi-metric discrimination framework that generates synthetic non-linguistic baselines calibrated with Zipfian frequencies, positional constraints, and bigram dependencies from attested corpora, then scores both the Indus inscriptions and the baselines on brevity, formulaic repetition, hapax rate, and positional rigidity.

If this is right

  • Non-linguistic models must be more complex than pure emblem or coding systems to account for all four Indus properties at once.
  • Single-metric arguments about brevity or repetition alone are insufficient to settle the linguistic status of the Indus signs.
  • The replicated Zipf slope and entropy values remain stable under the new multi-metric test.
  • Any future non-linguistic explanation must be checked against the full set of four metrics rather than selected subsets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The intermediate profile raises the possibility that the Indus system combines features from multiple generative mechanisms not captured by the current baselines.
  • Applying the same scorecard to other short, formulaic ancient scripts could test whether they fall into similar intermediate zones.
  • If additional metrics such as sign co-occurrence entropy or directionality constraints were added, the distance between Indus data and non-linguistic baselines might increase or shrink in measurable ways.

Load-bearing premise

That the two synthetic generator families together with the seven real non-linguistic corpora cover the space of possible non-linguistic sign systems and that the four chosen metrics are jointly sufficient to detect linguistic structure.

What would settle it

Discovery of any real or newly engineered non-linguistic sign system that simultaneously reproduces the Indus values for text brevity, formulaic repetition, hapax rate, and positional rigidity would falsify the intermediate-position claim.

Figures

Figures reproduced from arXiv: 2604.17828 by Ashish Nair.

Figure 1
Figure 1. Figure 1: Distribution of inscription lengths across the corpus (n=2,511). Dashed red line indicates mean [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Zipf’s Law in the Indus sign system. Log-log plot of sign frequency vs. rank with OLS fit [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Conditional entropy of the Indus corpus (red dashed line, 3.232 bits) vs. the within-inscription [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FSW Scorecard comparing the Indus corpus (blue) against heraldic (orange) and administrative [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Whether the Indus Valley sign system (c. 2600-1900 BCE) encodes spoken language has been debated for decades. This paper introduces a multi-metric discrimination framework that tests the observed Indus corpus against two kinds of computer-generated non-linguistic baseline -- one mimicking a heraldic emblem system, the other an administrative coding system -- each calibrated with Zipfian frequency distributions, positional constraints, and bigram dependencies derived from six attested non-linguistic corpora. The scorecard evaluates four properties central to the Farmer-Sproat-Witzel (2004) critique: text brevity, repeated formulaic phrases, hapax legomenon rate, and positional rigidity. Applying this framework to 1,916 deduplicated inscriptions (584 unique signs, 11,110 tokens) from the ICIT/Yajnadevam digitization, we find that the Indus corpus does not match either baseline cleanly. Across the four metrics examined, the Indus corpus occupies an intermediate position relative to the two baseline families, matching neither cleanly. Neither a heraldic nor an administrative generator can reproduce all four properties at once. We also compare against seven real-world non-linguistic corpora including Sproat's (2014) datasets, finding that no attested non-linguistic system reproduces the full Indus statistical profile either. We replicate key prior results including a Zipf slope of -1.49 and conditional entropy of 3.23 bits. All code and data are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a multi-metric discrimination framework that compares the statistical profile of the Indus Valley sign system (1,916 deduplicated inscriptions, 584 signs, 11,110 tokens) against two families of synthetic non-linguistic baselines (heraldic emblem and administrative coding systems) calibrated on Zipfian frequencies, positional constraints, and bigram dependencies from six attested corpora, plus seven real non-linguistic corpora. It evaluates four properties (brevity, formulaic repetition, hapax rate, positional rigidity) and reports that the Indus corpus occupies an intermediate position, matching neither baseline family nor any attested non-linguistic system on all metrics simultaneously. The work also replicates prior quantitative results (Zipf slope of -1.49 and conditional entropy of 3.23 bits) and releases all code and data publicly.

Significance. If the results hold, this provides a replicable quantitative scorecard for assessing undeciphered scripts against calibrated non-linguistic proxies, adding empirical data to the linguistic-status debate. Strengths include the public code and data release, explicit replication of prior metrics, and anchoring of baselines in external attested corpora rather than ad-hoc invention.

major comments (2)
  1. [Synthetic baseline generation and calibration] The central claim that the Indus profile matches neither baseline family nor any attested non-linguistic corpus depends on the assumption that the two synthetic generator families (calibrated only on frequency, positional, and bigram statistics from six external corpora) adequately span the space of possible non-linguistic systems. The manuscript does not test or bound whether alternative structures (e.g., hierarchical codes or higher-order dependencies) could simultaneously match the observed values on all four metrics; this is load-bearing for interpreting the mismatch result.
  2. [Comparison against real-world non-linguistic corpora] The statement that 'no attested non-linguistic system reproduces the full Indus statistical profile' is based on seven specific corpora (including Sproat 2014 datasets). The selection criteria for these corpora and any sensitivity analysis to corpus choice are not detailed, leaving the generality of the 'no match' conclusion open to the risk that other unexamined systems could align.
minor comments (2)
  1. [Corpus description] The abstract and methods should explicitly cross-reference the exact sections or supplementary material that document the deduplication procedure, sign inventory construction, and any exclusion rules applied to the 1,916 inscriptions.
  2. [Metric definitions] Notation for the four metrics (brevity, formulaic repetition, hapax rate, positional rigidity) would benefit from a single consolidated table or equation block early in the paper to improve readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed report. The comments highlight important considerations regarding the scope of our synthetic baselines and the selection of real-world corpora. We address each major comment below and indicate where revisions will be made to improve clarity and transparency.

read point-by-point responses
  1. Referee: [Synthetic baseline generation and calibration] The central claim that the Indus profile matches neither baseline family nor any attested non-linguistic corpus depends on the assumption that the two synthetic generator families (calibrated only on frequency, positional, and bigram statistics from six external corpora) adequately span the space of possible non-linguistic systems. The manuscript does not test or bound whether alternative structures (e.g., hierarchical codes or higher-order dependencies) could simultaneously match the observed values on all four metrics; this is load-bearing for interpreting the mismatch result.

    Authors: We agree that the two generator families do not exhaustively span all possible non-linguistic structures. Our heraldic and administrative baselines were chosen specifically because they instantiate two distinct archetypes commonly discussed in the literature on the Indus script, with parameters derived from six attested corpora to ground them empirically rather than invent them ad hoc. The central empirical result remains that neither family reproduces the full set of four Indus metrics simultaneously, and the same holds for the seven real corpora examined. We will add a new paragraph in the Discussion section acknowledging that more complex mechanisms (hierarchical codes, higher-order dependencies) are untested and could in principle produce closer matches; we will also note that fully bounding the space of all conceivable non-linguistic systems lies beyond the scope of any single study. This revision clarifies interpretive limits without changing the reported comparisons or conclusions. revision: partial

  2. Referee: [Comparison against real-world non-linguistic corpora] The statement that 'no attested non-linguistic system reproduces the full Indus statistical profile' is based on seven specific corpora (including Sproat 2014 datasets). The selection criteria for these corpora and any sensitivity analysis to corpus choice are not detailed, leaving the generality of the 'no match' conclusion open to the risk that other unexamined systems could align.

    Authors: The seven corpora were selected as the principal publicly available non-linguistic datasets that have been used in prior quantitative work on the Indus debate, including the Sproat 2014 collection and related sources. In the revised manuscript we will add an explicit subsection in Methods describing the selection criteria (public availability, relevance to the linguistic-status literature, and structural variety) and will include a sensitivity table showing that the 'no full-profile match' result is robust across different subsets of the seven corpora. These additions will be incorporated in the next version. revision: yes

standing simulated objections not resolved
  • We cannot exhaustively test or bound every conceivable non-linguistic structure, as the design space is open-ended.

Circularity Check

0 steps flagged

No circularity: empirical comparisons anchored externally

full rationale

The paper constructs two synthetic generator families by fitting Zipfian frequencies, positional constraints, and bigram dependencies to statistics drawn from six external attested non-linguistic corpora, then evaluates the Indus corpus (1,916 inscriptions) against these generators and seven real-world non-linguistic corpora on four metrics. No derivation step reduces by construction to the paper's own inputs; the central claim that the Indus profile matches neither baseline family nor any attested corpus rests on independent external benchmarks rather than self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The framework is self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the four metrics drawn from the Farmer-Sproat-Witzel critique are diagnostic and that the chosen real-world corpora provide representative calibration targets. No new entities are postulated and no parameters appear to be fitted directly to the Indus data itself.

axioms (1)
  • domain assumption The four properties (text brevity, repeated formulaic phrases, hapax legomenon rate, positional rigidity) are the central properties for testing the Farmer-Sproat-Witzel critique of linguistic status.
    These four properties are selected as the evaluation criteria because they are stated to be central to the 2004 critique.

pith-pipeline@v0.9.0 · 5557 in / 1449 out tokens · 69868 ms · 2026-05-10T04:18:03.925911+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Odyssey: Constructing Verifiable Local Truth-Preserving Foundation Models

    cs.AI 2026-06 unverdicted novelty 3.0

    ODYSSEY is a sheaf-theoretic framework for building verifiable foundation models as compositions of foundries via left and right Kan extensions.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 1 Pith paper

  1. [1]

    Cuneiform Digital Library Initiative (CDLI).https://cdli.ucla.edu/

  2. [2]

    Farmer, S., Sproat, R., & Witzel, M. (2004). The collapse of the Indus-script thesis: The myth of a literate Harappan civilization.Electronic Journal of V edic Studies, 11(2), 19–57

  3. [3]

    Kumar, A., Subramanian, S., et al. (2021). Neural sequence models for the Indus script.arXiv preprint

  4. [4]

    (1977).The Indus Script: Texts, Concordance and Tables

    Mahadevan, I. (1977).The Indus Script: Texts, Concordance and Tables. Memoirs of the Archaeo- logical Survey of India, No. 77

  5. [5]

    (1994).Deciphering the Indus Script

    Parpola, A. (1994).Deciphering the Indus Script. Cambridge University Press

  6. [6]

    Rao, R. P. N., Yadav, N., Vahia, M. N., Joglekar, H., Adhikari, R., & Mahadevan, I. (2009). Entropic evidence for linguistic structure in the Indus script.Science, 324(5931), 1165

  7. [7]

    Sproat, R. (2010). Ancient symbols, computational linguistics, and the reviewing practices of the general science journals.Computational Linguistics, 36(3), 585–594. 12

  8. [8]

    Sproat, R. (2014). A statistical comparison of written language and nonlinguistic symbol systems. Language Resources and Evaluation, 48(4), 681–693

  9. [9]

    Vidale, M. (2007). The collapse melts down: A reply to Farmer, Sproat, and Witzel.East and West, 57(1–4), 333–366

  10. [10]

    Wells, B. K. (2015).An Epigraphic Approach to Indus Writing. Oxbow Books

  11. [11]

    Yadav, N., Joglekar, H., Rao, R. P. N., Vahia, M. N., Adhikari, R., & Mahadevan, I. (2010). Statis- tical analysis of the Indus script using n-grams.PLoS ONE, 5(3), e9506

  12. [12]

    N., Mahadevan, I., & Joglekar, H

    Yadav, N., Vahia, M. N., Mahadevan, I., & Joglekar, H. (2008). Segmentation of Indus texts.Inter- national Journal of Dravidian Linguistics, 37(1), 39–52. 13