Translators as Invisible Teachers of AI: Copyright, Translation Memory, and the Political Economy of Linguistic Data

Masaru Yamada

arxiv: 2605.24842 · v1 · pith:LV572KOTnew · submitted 2026-05-24 · 💻 cs.CL · cs.CY

Translators as Invisible Teachers of AI: Copyright, Translation Memory, and the Political Economy of Linguistic Data

Masaru Yamada This is my paper

Pith reviewed 2026-06-30 12:32 UTC · model grok-4.3

classification 💻 cs.CL cs.CY

keywords translation memorymachine translationAI training datacopyright lawinvisible labordata appropriationlanguage service providersmodel collapse

0 comments

The pith

Translators have served as the invisible teachers of AI by supplying the translation data that trained statistical, neural, and large language models, yet receive no attribution under current copyright and contract rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper argues that the labor of translators has been transformed into foundational data capital for AI through translation memories and parallel corpora. These resources provide the supervised training data essential for the development of statistical machine translation, neural machine translation, the Transformer architecture, and multilingual large language models. Translators' contributions are bought as contract deliverables and classified as information analysis under Japanese, European, and US copyright law, resulting in the loss of moral, creative, and economic attribution. The paper introduces the concepts of appropriation without consumption and the invisible teacherisation of translators to describe this process and explores the data supply chain and possibilities for redistributive design.

Core claim

The development of statistical machine translation, neural machine translation, the Transformer architecture, and multilingual large language models cannot be disentangled from the accumulation of translation data. Translators have functioned as invisible teachers of AI through the construction of translation memories, post-editing, and quality assessment without recognition as such. Their renditions are processed as information analysis data under copyright law, losing attribution, in a process the paper terms appropriation without consumption.

What carries the argument

Invisible teacherisation: the process by which translators, through the construction of translation memories, post-editing, and quality assessment, have functioned as teachers of AI without recognition as such.

Load-bearing premise

The legal classification of translation data as information analysis under copyright frameworks, together with contractual purchase of deliverables, necessarily erases translators' moral, creative, and economic attribution.

What would settle it

An audit of major multilingual LLM training datasets that checks whether professional translation memories appear with or without attribution to the original translators who produced the target texts.

read the original abstract

This paper examines how the labour of translators has been transformed into foundational data capital for the age of artificial intelligence (AI). Translation memories (TM) and parallel corpora preserve a one-to-one correspondence between source and target text and therefore constitute extraordinarily valuable supervised training data for machine translation. The development of statistical machine translation (SMT), neural machine translation (NMT), the Transformer architecture, and multilingual large language models (LLMs) cannot be disentangled from the accumulation of such translation data. And yet, translators' renditions have been bought as deliverables under contract, segmented as technical objects, and processed as "information analysis" data under copyright law -- losing their moral, creative, and economic attribution to the translators who produced them. The paper develops two concepts to capture this process. The first is appropriation without consumption: a mode of use in which works are not read, viewed, or listened to, but only mined for statistical features -- a use that is legitimated under Article 30-4 of the Japanese Copyright Act. The second is the invisible teacherisation of translators: the process by which translators, through the construction of translation memories, post-editing, and quality assessment, have functioned as teachers of AI without recognition as such. Drawing on the data supply chain that runs from translators through language service providers (LSPs) and platforms to model developers, on a comparative reading of Japanese, European, and United States legal frameworks, on the distinction between open and proprietary AI models, and on the premium status that human-generated data has acquired in the era of model collapse, the paper asks what translators are actually afraid of, and points toward concrete directions for redistributive design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper coins two new terms to frame translators as uncredited data suppliers for AI but the legal and supply-chain claims stay mostly interpretive.

read the letter

The punchline is that this work gives us 'appropriation without consumption' and 'invisible teacherisation' as fresh ways to talk about how translation memories and post-editing feed SMT, NMT, and LLMs without the original labor getting recognized. Those labels are not in the standard MT or copyright literature the abstract cites, so they are genuinely new devices.

It handles the supply chain from translators through LSPs to model developers clearly and ties it to the current premium on human data amid model collapse. The comparative nod to Japanese Article 30-4, EU, and US rules is a reasonable starting point for the legal angle.

The soft spots are proportionate. The central move treats contractual purchase plus 'information analysis' classification as automatically erasing moral and economic attribution, yet the abstract (and the stress-test note) gives no specific cases, compensation figures, or falsifiable checks on how often that erasure actually occurs. The argument is conceptual and normative rather than carrying new empirical weight or formal derivations.

This is for readers already working on AI labor, data provenance, or copyright exceptions in training sets. Someone tracking policy angles on linguistic data could pull useful framing from it.

It deserves a serious referee because the topic is current and the concepts are original, even if the evidence base is descriptive.

Referee Report

0 major / 0 minor

Summary. The paper claims that translators' labor, embodied in translation memories and parallel corpora, has been foundational to the development of statistical machine translation, neural machine translation, the Transformer architecture, and multilingual LLMs, yet this labor is transformed into data capital through contractual purchase of deliverables and legal classification as 'information analysis' (e.g., under Japanese Copyright Act Article 30-4). It introduces the concepts of 'appropriation without consumption' (mining statistical features without reading or viewing the works) and 'invisible teacherisation of translators' (via TM construction, post-editing, and quality assessment) to describe the erasure of moral, creative, and economic attribution. Drawing on the data supply chain from translators through LSPs to model developers, comparative analysis of Japanese, EU, and US copyright frameworks, open vs. proprietary models, and the premium on human data amid model collapse, the paper examines translators' concerns and outlines directions for redistributive design.

Significance. If the interpretive claims hold, the paper offers a valuable socio-legal analysis of how linguistic labor underpins AI systems, introducing two original concepts that frame data appropriation in translation. It integrates historical MT developments with current supply-chain and legal observations, providing a grounded basis for normative arguments on attribution and redistribution. This could inform policy discussions on data rights in NLP and AI, particularly regarding the role of human-generated data. The comparative legal reading and supply-chain description are strengths that lend concreteness to an otherwise conceptual argument.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript, including the recognition of its socio-legal analysis, the two original concepts introduced, and the integration of historical MT developments with supply-chain and legal observations. The recommendation to accept is appreciated.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper advances a conceptual socio-legal argument about the transformation of translator labor into AI training data, supported by external references to Japanese Copyright Act Article 30-4, EU/US legal frameworks, industry supply-chain descriptions, and distinctions between open/proprietary models. No derivation chain, equations, fitted parameters, or self-referential definitions are present that would reduce the central claims to inputs by construction. The analysis relies on independent legal texts and historical observations rather than any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on two invented conceptual entities and several domain assumptions about how copyright law treats segmented translation data; no numerical free parameters are present.

axioms (2)

domain assumption Translation memories preserve a one-to-one correspondence between source and target text and therefore constitute valuable supervised training data.
Stated directly in the abstract as the basis for the economic value of translator output.
domain assumption Translators' renditions are bought as deliverables under contract and processed as information analysis data under copyright law.
Foundational premise for the claim that attribution is lost.

invented entities (2)

appropriation without consumption no independent evidence
purpose: A mode of use in which works are mined for statistical features rather than read, viewed, or listened to.
New concept introduced to describe legitimation under Article 30-4 of the Japanese Copyright Act.
invisible teacherisation of translators no independent evidence
purpose: The process by which translators function as teachers of AI through translation memory construction, post-editing, and quality assessment without recognition.
New concept introduced to capture unrecognized labor in the AI data supply chain.

pith-pipeline@v0.9.1-grok · 5837 in / 1674 out tokens · 65134 ms · 2026-06-30T12:32:24.971614+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 1 canonical work pages · 1 internal anchor

[1]

General Approach to AI and Copyright [in Japanese]

Agency for Cultural Affairs (2024). General Approach to AI and Copyright [in Japanese]. Subcommittee on Legal Issues, Copyright Subdivision, Council for Cultural Affairs. Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learn- ing to Align and Translate. Proceedings of the 3rd International Conference on Learning Represent...

2024
[2]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

European Parliament and Council (2019). Directive (EU) 2019/790 on Copyright and Related Rights in the Digital Single Market (DSM Directive), Articles 3–4. European Parliament and Council (2024). Regulation (EU) 2024/1689 Laying Down Har- monised Rules on Artificial Intelligence (AI Act). International Federation of Translators (FIT) (2023). FIT Position ...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[1] [1]

General Approach to AI and Copyright [in Japanese]

Agency for Cultural Affairs (2024). General Approach to AI and Copyright [in Japanese]. Subcommittee on Legal Issues, Copyright Subdivision, Council for Cultural Affairs. Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural Machine Translation by Jointly Learn- ing to Align and Translate. Proceedings of the 3rd International Conference on Learning Represent...

2024

[2] [2]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

European Parliament and Council (2019). Directive (EU) 2019/790 on Copyright and Related Rights in the Digital Single Market (DSM Directive), Articles 3–4. European Parliament and Council (2024). Regulation (EU) 2024/1689 Laying Down Har- monised Rules on Artificial Intelligence (AI Act). International Federation of Translators (FIT) (2023). FIT Position ...

work page internal anchor Pith review Pith/arXiv arXiv 2019