pith. sign in

arxiv: 2605.02052 · v1 · submitted 2026-05-03 · 💻 cs.CL

Methods, Data, and Conceptual Change: Reflections from Two Quantitative Diachronic Case Studies

Pith reviewed 2026-05-08 19:21 UTC · model grok-4.3

classification 💻 cs.CL
keywords quantitative historical linguisticsdiachronic semantic changecorpus structuremethodological comparisonlexical frequency methodsEarly Modern Englishscientific discourse
0
0 comments X

The pith

Comparative reflection on two quantitative studies shows dataset structure limits what kinds of semantic change frequency-based methods can detect.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This discussion paper examines how quantitative methods for tracking language change over time depend on the properties of the historical datasets they analyze. It draws on one case that models concepts through quad-based analysis of a large Early Modern English corpus and another that applies SynFlow analysis to scientific texts from the Royal Society. By placing the two approaches side by side, the authors demonstrate that purely lexical and frequency-driven techniques have built-in restrictions and that the organization of the underlying data determines which diachronic shifts can be identified with confidence.

Core claim

Through parallel examination of quad-based concept modelling on EEBO-TCP data (c. 1470s-1690s) and SynFlow analysis on the Royal Society Corpus (1750-1799), the paper establishes that dataset structure shapes the kinds of semantic change quantitative methods can reliably detect and thereby clarifies the inherent limits of approaches that rely solely on lexical frequency.

What carries the argument

Comparative methodological reflection that contrasts how each of the two chosen techniques operationalizes concepts, the data assumptions each carries, and the diachronic interpretations each supports.

Load-bearing premise

The assumption that the operational choices and interpretive limits observed in these two specific corpora and methods will hold for quantitative diachronic work in general.

What would settle it

A quantitative study using different corpora and methods that successfully detects all major types of semantic change without any detectable dependence on dataset structure would challenge the central claim.

read the original abstract

This discussion paper reflects on how quantitative approaches to historical linguistics interact with dataset properties. Drawing on two worked examples, we examine English data using quad-based concept modelling of Early Modern English discourse in EEBO-TCP (c. 1470s-1690s; 765M words) alongside SynFlow analysis of scientific writing in Royal Society Corpus 6.0.4 (1750-1799; drawn from a 78.6M-token open corpus). Through parallel comparison, the paper explores how each approach operationalises concepts, the data assumptions they entail, and the diachronic interpretations they support. We argue that comparative methodological reflection clarifies the limits of purely lexical, frequency-based approaches and highlights how dataset structure shapes the kinds of semantic change that quantitative methods can reliably detect.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. This discussion paper reflects on how quantitative approaches to historical linguistics interact with dataset properties. Drawing on two worked examples, it examines English data using quad-based concept modelling of Early Modern English discourse in EEBO-TCP (c. 1470s-1690s; 765M words) alongside SynFlow analysis of scientific writing in Royal Society Corpus 6.0.4 (1750-1799; 78.6M tokens). Through parallel comparison, the paper explores how each approach operationalises concepts, the data assumptions they entail, and the diachronic interpretations they support. The central argument is that comparative methodological reflection clarifies the limits of purely lexical, frequency-based approaches and highlights how dataset structure shapes the kinds of semantic change that quantitative methods can reliably detect.

Significance. If the interpretive claims hold, the paper makes a useful contribution to computational historical linguistics by supplying concrete, parallel case studies that illustrate often-overlooked interactions between method and corpus structure. The explicit use of two large, publicly referenced corpora (EEBO-TCP and Royal Society Corpus) and the side-by-side comparison of distinct operationalizations provide practical grounding that strengthens the reflective argument. Such discussion pieces help the field move beyond purely lexical frequency counts toward more data-aware quantitative work.

minor comments (2)
  1. [Abstract] The abstract introduces 'quad-based concept modelling' and 'SynFlow analysis' without a one-sentence gloss or pointer to the relevant literature; adding brief definitional phrases would improve accessibility for readers outside the immediate subfield.
  2. [Case-study comparison section] The manuscript would benefit from a short table or paragraph that explicitly contrasts the two methods' handling of frequency information versus other features (e.g., co-occurrence patterns or syntactic context), making the claimed 'limits of purely lexical, frequency-based approaches' easier to evaluate directly from the examples.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of our discussion paper and the recommendation for minor revision. The assessment correctly identifies the value of our side-by-side comparison of quad-based concept modelling on EEBO-TCP and SynFlow analysis on the Royal Society Corpus in clarifying interactions between method, data structure, and detectable semantic change.

Circularity Check

0 steps flagged

No circularity: reflective discussion without derivations or self-referential predictions

full rationale

The paper is explicitly a methodological reflection that draws on two independent worked examples (quad-based modelling on EEBO-TCP and SynFlow on the Royal Society Corpus) to illustrate interactions between quantitative techniques and corpus structure. No equations, fitted parameters, predictions, or uniqueness theorems are claimed; the central argument is interpretive and rests on direct comparison of the case studies rather than any reduction to inputs by construction or self-citation chains. All load-bearing steps are external to the paper's own data processing and remain falsifiable through the cited corpora and methods.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is a methodological reflection and introduces no new free parameters, mathematical axioms, or invented entities; it relies on standard domain assumptions in computational historical linguistics about how concepts can be operationalized from text.

axioms (1)
  • domain assumption Quantitative methods can operationalize abstract concepts from historical text collections in ways that support diachronic interpretation
    Invoked when describing quad-based modelling and SynFlow analysis as tools for examining conceptual change.

pith-pipeline@v0.9.0 · 5435 in / 1106 out tokens · 25422 ms · 2026-05-08T19:21:08.017241+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages

  1. [1]

    Fischer, S., Knappen, J., Menzel, K., & Teich, E. (2020). The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study. In N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Twelfth Lan...

  2. [2]

    A., Alexander, M., Hine, I

    Fitzmaurice, S., Robinson, J. A., Alexander, M., Hine, I. C., Mehl, S., & Dallachy, F. (2017). Linguistic DNA: investigating conceptual change in Early Modern English discourse. Studia Neophilologica, 89(sup1), 21-38

  3. [3]

    Fitzmaurice, S., & Mehl, S. (2022). Volatile concepts: Analysing discursive change through underspecification in co -occurrence quads. International Journal of Corpus Linguistics, 27(4), 428-450

  4. [4]

    Kermes, H., Degaetano-Ortlieb, S., Khamis, A., Knappen, J., & Teich, E. (2016). The Royal Society Corpus: From Uncharted Data to Corpus. In N. Calzolari, K. Choukri, T

  5. [5]

    Goggi, M

    Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (pp. 1928–1931). European Language Resources Association (ELRA). https://aclanthology.org/L16-1305/

  6. [6]

    Knappen, J., Fischer, S., Kermes, H., Teich, E., & Fankhauser, P. (2017). The Making of the Royal Society Corpus. In G. Bouma & Y. Adesam (Eds.), Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language (pp. 7–11). Linköping University Electronic Press. https://aclanthology.org/W17-0503/

  7. [7]

    Menzel, K., Knappen, J., & Teich, E. (2021). Generating linguistically relevant metadata for the Royal Society Corpus. Research in Corpus Linguistics, 9(1), 1–18. https://doi.org/10.32714/ricl.09.01.02 Phan-Tất, B. (2025). SynFlow [Computer software]. Zenodo. https://doi.org/10.5281/zenodo.17414457

  8. [8]

    Qi, P., Zhang, Y., Zhang, Y., Bolton, J., & Manning, C. D. (2020). Stanza: A Python Natural Language Processing Toolkit for Many Human Languages (arXiv:2003.07082). arXiv. https://doi.org/10.48550/arXiv.2003.07082 Text Creation Partnership (TCP). (2020). Early English Books Online Text Creation Partnership (EEBO-TCP): Phase I & II Transcriptions. https://...