Translators as Invisible Teachers of AI: Copyright, Translation Memory, and the Political Economy of Linguistic Data
Pith reviewed 2026-06-30 12:32 UTC · model grok-4.3
The pith
Translators have served as the invisible teachers of AI by supplying the translation data that trained statistical, neural, and large language models, yet receive no attribution under current copyright and contract rules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The development of statistical machine translation, neural machine translation, the Transformer architecture, and multilingual large language models cannot be disentangled from the accumulation of translation data. Translators have functioned as invisible teachers of AI through the construction of translation memories, post-editing, and quality assessment without recognition as such. Their renditions are processed as information analysis data under copyright law, losing attribution, in a process the paper terms appropriation without consumption.
What carries the argument
Invisible teacherisation: the process by which translators, through the construction of translation memories, post-editing, and quality assessment, have functioned as teachers of AI without recognition as such.
Load-bearing premise
The legal classification of translation data as information analysis under copyright frameworks, together with contractual purchase of deliverables, necessarily erases translators' moral, creative, and economic attribution.
What would settle it
An audit of major multilingual LLM training datasets that checks whether professional translation memories appear with or without attribution to the original translators who produced the target texts.
read the original abstract
This paper examines how the labour of translators has been transformed into foundational data capital for the age of artificial intelligence (AI). Translation memories (TM) and parallel corpora preserve a one-to-one correspondence between source and target text and therefore constitute extraordinarily valuable supervised training data for machine translation. The development of statistical machine translation (SMT), neural machine translation (NMT), the Transformer architecture, and multilingual large language models (LLMs) cannot be disentangled from the accumulation of such translation data. And yet, translators' renditions have been bought as deliverables under contract, segmented as technical objects, and processed as "information analysis" data under copyright law -- losing their moral, creative, and economic attribution to the translators who produced them. The paper develops two concepts to capture this process. The first is appropriation without consumption: a mode of use in which works are not read, viewed, or listened to, but only mined for statistical features -- a use that is legitimated under Article 30-4 of the Japanese Copyright Act. The second is the invisible teacherisation of translators: the process by which translators, through the construction of translation memories, post-editing, and quality assessment, have functioned as teachers of AI without recognition as such. Drawing on the data supply chain that runs from translators through language service providers (LSPs) and platforms to model developers, on a comparative reading of Japanese, European, and United States legal frameworks, on the distinction between open and proprietary AI models, and on the premium status that human-generated data has acquired in the era of model collapse, the paper asks what translators are actually afraid of, and points toward concrete directions for redistributive design.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that translators' labor, embodied in translation memories and parallel corpora, has been foundational to the development of statistical machine translation, neural machine translation, the Transformer architecture, and multilingual LLMs, yet this labor is transformed into data capital through contractual purchase of deliverables and legal classification as 'information analysis' (e.g., under Japanese Copyright Act Article 30-4). It introduces the concepts of 'appropriation without consumption' (mining statistical features without reading or viewing the works) and 'invisible teacherisation of translators' (via TM construction, post-editing, and quality assessment) to describe the erasure of moral, creative, and economic attribution. Drawing on the data supply chain from translators through LSPs to model developers, comparative analysis of Japanese, EU, and US copyright frameworks, open vs. proprietary models, and the premium on human data amid model collapse, the paper examines translators' concerns and outlines directions for redistributive design.
Significance. If the interpretive claims hold, the paper offers a valuable socio-legal analysis of how linguistic labor underpins AI systems, introducing two original concepts that frame data appropriation in translation. It integrates historical MT developments with current supply-chain and legal observations, providing a grounded basis for normative arguments on attribution and redistribution. This could inform policy discussions on data rights in NLP and AI, particularly regarding the role of human-generated data. The comparative legal reading and supply-chain description are strengths that lend concreteness to an otherwise conceptual argument.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript, including the recognition of its socio-legal analysis, the two original concepts introduced, and the integration of historical MT developments with supply-chain and legal observations. The recommendation to accept is appreciated.
Circularity Check
No significant circularity identified
full rationale
The paper advances a conceptual socio-legal argument about the transformation of translator labor into AI training data, supported by external references to Japanese Copyright Act Article 30-4, EU/US legal frameworks, industry supply-chain descriptions, and distinctions between open/proprietary models. No derivation chain, equations, fitted parameters, or self-referential definitions are present that would reduce the central claims to inputs by construction. The analysis relies on independent legal texts and historical observations rather than any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Translation memories preserve a one-to-one correspondence between source and target text and therefore constitute valuable supervised training data.
- domain assumption Translators' renditions are bought as deliverables under contract and processed as information analysis data under copyright law.
invented entities (2)
-
appropriation without consumption
no independent evidence
-
invisible teacherisation of translators
no independent evidence
Reference graph
Works this paper leans on
-
[1]
General Approach to AI and Copyright [in Japanese]
Agency for Cultural Affairs (2024). General Approach to AI and Copyright [in Japanese]. Subcommittee on Legal Issues, Copyright Subdivision, Council for Cultural Affairs. Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learn- ing to Align and Translate. Proceedings of the 3rd International Conference on Learning Represent...
2024
-
[2]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
European Parliament and Council (2019). Directive (EU) 2019/790 on Copyright and Related Rights in the Digital Single Market (DSM Directive), Articles 3–4. European Parliament and Council (2024). Regulation (EU) 2024/1689 Laying Down Har- monised Rules on Artificial Intelligence (AI Act). International Federation of Translators (FIT) (2023). FIT Position ...
work page internal anchor Pith review Pith/arXiv arXiv 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.