Targum -- A Multilingual New Testament Translation Corpus
Pith reviewed 2026-05-16 03:11 UTC · model grok-4.3
The pith
A new corpus of 651 New Testament translations across five languages supplies 334 unique versions with metadata that supports both fine-grained and broad historical analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By collecting 651 New Testament translations from twelve online libraries and one prior corpus, then canonicalizing each with standardized identifiers for work, edition, and revision year, the resulting set of 334 unique texts enables flexible micro-level analysis of translation families alongside macro-level studies after deduplication.
What carries the argument
Canonicalization metadata that assigns each translation a standardized identifier for work, edition, and revision year, enabling custom definitions of uniqueness at different scales of analysis.
If this is right
- Researchers can conduct micro-level analyses on specific translation lineages such as the KJV family.
- Macro-level studies can proceed by deduplicating closely related texts to reveal broader patterns.
- Quantitative work on translation history gains access to greater depth per language than prior corpora provided.
- Users can tailor the definition of uniqueness to match the needs of their particular research question.
Where Pith is reading between the lines
- The resource could support computational models trained on historical religious texts to track changes in language use over centuries.
- Cross-language comparisons might surface patterns in how biblical content was adapted to different cultural contexts.
- Extensions that add the Old Testament or further languages would allow wider tests of translation evolution across time and regions.
Load-bearing premise
Translations gathered from the twelve online libraries can be reliably grouped and deduplicated using metadata on work, edition, and year without major errors or omissions of significant variants.
What would settle it
Finding a substantial number of translations that cannot be accurately mapped to unique work-edition-year identifiers or discovering large coverage gaps among the source libraries.
read the original abstract
Many European languages possess rich biblical translation histories, yet existing corpora - in prioritizing linguistic breadth - often fail to capture this depth. To address this gap, we introduce a multilingual corpus of 651 New Testament translations, of which 334 are unique, spanning five languages with 2.4-5.0x more translations per language than any prior corpus: English (194 unique versions from 390 total), French (41 from 78), Italian (17 from 33), Polish (29 from 48), and Spanish (53 from 102). Aggregated from 12 online biblical libraries and one preexisting corpus, each translation is annotated with metadata that maps the text to a standardized identifier for the work, its specific edition, and its year of revision. This canonicalization allows researchers to define "uniqueness" for their own needs: they can perform micro-level analyses on translation families, such as the KJV lineage, or conduct macro-level studies by deduplicating closely related texts. By providing the first multilingual resource with sufficient depth per language for flexible, multilevel analysis, the corpus fills a gap in the quantitative study of translation history.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Targum, a multilingual corpus of 651 New Testament translations (334 unique after canonicalization) spanning English (194 unique from 390 total), French (41 from 78), Italian (17 from 33), Polish (29 from 48), and Spanish (53 from 102). The resource aggregates texts from 12 online biblical libraries plus one preexisting corpus and annotates each with metadata mapping to standardized identifiers for work, edition, and year of revision. This canonicalization is intended to support flexible, user-defined notions of uniqueness for micro-level (e.g., KJV lineage families) or macro-level (deduplicated) quantitative studies of translation history.
Significance. If the metadata mapping is reliable, the corpus supplies the first multilingual New Testament resource with sufficient per-language depth (2.4–5.0× prior corpora) to enable multilevel quantitative analyses of translation history that breadth-focused collections have not supported. The explicit support for researcher-defined uniqueness is a practical strength for downstream work in computational philology and historical linguistics.
major comments (1)
- [§3] §3 (Corpus Construction and Canonicalization): The aggregation and metadata canonicalization process from 12 heterogeneous online libraries is described at a high level but provides no validation protocol, manual audit sample size, inter-annotator agreement, or error-rate estimate for edition/year assignment. Because the central claim of reliable uniqueness definitions (and thus the 334-unique count) rests entirely on this step, the absence of any quantitative check on transcription or identification errors is load-bearing.
minor comments (2)
- [Table 1] Table 1 (language statistics): the column headers and footnotes should explicitly distinguish total vs. unique counts and state the deduplication rule applied for the reported 334 figure.
- [Abstract and §1] The abstract and §1 could add one sentence on total token count or average length per translation to give readers a sense of corpus scale beyond version counts.
Simulated Author's Rebuttal
We thank the referee for their careful review and constructive feedback on our manuscript. We address the major comment below.
read point-by-point responses
-
Referee: [§3] §3 (Corpus Construction and Canonicalization): The aggregation and metadata canonicalization process from 12 heterogeneous online libraries is described at a high level but provides no validation protocol, manual audit sample size, inter-annotator agreement, or error-rate estimate for edition/year assignment. Because the central claim of reliable uniqueness definitions (and thus the 334-unique count) rests entirely on this step, the absence of any quantitative check on transcription or identification errors is load-bearing.
Authors: We agree that the current manuscript provides only a high-level description of the aggregation and canonicalization process and lacks details on validation. In the revised version, we will expand §3 to include a detailed protocol for how metadata was extracted and mapped from the 12 sources, including examples of conflict resolution for edition and year assignments. We will also describe the manual review process used for ambiguous cases. However, since no formal validation study with sample sizes, inter-annotator agreement, or error rates was performed, we cannot supply those quantitative measures. We will instead acknowledge this as a limitation and discuss potential sources of error in the metadata mapping. revision: partial
Circularity Check
No circularity: purely descriptive resource paper with no derivations or self-referential reductions
full rationale
The paper introduces a new corpus by aggregating 651 translations from 12 libraries plus one preexisting corpus, then annotates each with work/edition/year metadata to support flexible uniqueness definitions. No equations, fitted parameters, predictions, or derivations exist anywhere in the described work. The central claim (first multilingual resource with sufficient per-language depth) is supported directly by the reported counts (e.g., 194 unique English versions) without any reduction to prior inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked. The work is self-contained descriptive resource creation, so the derivation chain is empty and the circularity score is 0.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Translations from online biblical libraries can be reliably aggregated and annotated with accurate metadata for work, edition, and revision year.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.