pith. sign in

arxiv: 2605.25358 · v1 · pith:XJSA6LVJnew · submitted 2026-05-25 · 💻 cs.CL · cs.AI· cs.CY

AI-Associated Lexical Shifts Across 34 Languages: Cross-Lingual Convergence and Diachronic Uptake in News Writing

Pith reviewed 2026-06-29 22:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY
keywords AI lexical shiftscross-lingual convergencenews writingdiachronic uptakeemphasize verbsWMT News CrawlGPT continuationssemantic convergence
0
0 comments X

The pith

AI lexical preferences extend across 34 languages with emphasize-type verbs recurring in 24 and rising in prevalence after ChatGPT.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether lexical choices linked to AI text generation, first observed in scientific English, also occur in news writing across 34 languages drawn from the WMT News Crawl corpus. It applies a diagnostic that generates GPT-4.1 continuations and contrasts their lemma frequencies against matched human text to rank AI-overused items by log prevalence ratios. The results show recurring semantic clusters, especially emphasize-type verbs, across typologically varied languages. Tracking the top items reveals increased use in 26 languages between 2020-2021 and 2023-2024, unlike matched baseline words. The pattern suggests AI text may be contributing to cross-lingual convergence in everyday writing.

Core claim

We find substantial cross-lingual semantic convergence: semantically related concepts recur across typologically diverse languages, with 'emphasize'-type verbs appearing in 24 of 34 languages. Prevalence increases in 26 of 34 languages from 2020-2021 to 2023-2024, with a mean change of +15.1%, whilst matched baseline words show no comparable increase (-4.5%). In 10 languages with longer historical coverage, longitudinal analyses show post-2022 increases that exceed the modest shifts observed in earlier periods, though with smaller effect sizes than in Scientific English.

What carries the argument

The split-halves continuation diagnostic that compares GPT-4.1 continuations with matched human gold-standard text to derive ranked AI-overused lemmas using log prevalence ratios.

If this is right

  • Semantically related concepts recur across typologically diverse languages.
  • Prevalence of the top 20 AI-overused items increases in 26 of 34 languages after ChatGPT release.
  • Matched baseline words show no comparable increase.
  • Post-2022 increases in 10 languages exceed earlier modest shifts.
  • The findings are consistent with AI exerting cross-lingual homogenising pressure on global language use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the pattern holds, news writing in multiple languages may gradually align on a shared set of AI-favored terms beyond the observed emphasize cluster.
  • The smaller effect sizes relative to scientific English suggest the uptake rate could vary by genre and audience.
  • Extending the same diagnostic to social media or fiction corpora could reveal whether convergence appears outside news.
  • Longer time series in additional languages would clarify whether the post-2022 acceleration continues or plateaus.

Load-bearing premise

The split-halves continuation diagnostic accurately isolates AI-associated lexical preferences without being driven by topic, style, or other non-AI differences between the two text sources.

What would settle it

A controlled test in which human and GPT-4.1 text are produced on identical narrow topics and styles yet still show the same ranked lemma differences, or independent news corpora showing no post-2022 rise in the identified emphasize-type lemmas.

Figures

Figures reproduced from arXiv: 2605.25358 by Thomas Stephan Juzek.

Figure 1
Figure 1. Figure 1: Embedding-based analysis of cross-lingual [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prevalence change of top-20 AI-associated [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Longitudinal prevalence of three AI-associated semantic concepts across languages (2012–2024). Each [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

AI-associated lexical shifts have been documented mainly in Scientific English. We extend this work to 34 languages in the WMT News Crawl corpus, refining a split-halves continuation diagnostic that compares GPT-4.1 continuations with matched human gold-standard text. For each language, we derive ranked AI-overused lemmas using log prevalence ratios. We find substantial cross-lingual semantic convergence: semantically related concepts recur across typologically diverse languages, with 'emphasize'-type verbs appearing in 24 of 34 languages. Embedding-based and manual analyses support this pattern. We also examine diachronic uptake in news writing before and after ChatGPT's release. Tracking each language's top 20 AI-overused items, we find prevalence increases in 26 of 34 languages from 2020-2021 to 2023-2024, with a mean change of +15.1%, whilst matched baseline words show no comparable increase (-4.5%). In 10 languages with longer historical coverage, longitudinal analyses show post-2022 increases that exceed the modest shifts observed in earlier periods, though with smaller effect sizes than in Scientific English. We validate our approach extensively, including across seeds, model variants, data sizes, model families, and more. Our findings are consistent with the view that AI-associated lexical preferences extend beyond English and may exert cross-lingual homogenising pressure on global language use.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript extends prior work on AI-associated lexical shifts from scientific English to 34 languages in the WMT News Crawl corpus. It introduces a split-halves continuation diagnostic that ranks lemmas by log prevalence ratio between GPT-4.1 continuations and matched human gold-standard text, then reports (i) cross-lingual semantic convergence (e.g., 'emphasize'-type verbs in 24 languages) supported by embeddings and manual checks, and (ii) diachronic uptake, with top-20 AI-overused lemmas rising +15.1% on average from 2020-2021 to 2023-2024 while matched baselines fall -4.5%. Extensive validation across seeds, model families, and data sizes is claimed.

Significance. If the diagnostic isolates AI-specific lexical preferences rather than generation artifacts, the results would indicate that AI lexical biases operate cross-lingually and exert measurable pressure on news writing after ChatGPT's release, extending beyond English and scientific domains. The parameter-free nature of the prevalence-ratio construction and the reported reproducibility across model variants are strengths.

major comments (1)
  1. [Methods / abstract] The split-halves continuation diagnostic (abstract and methods): the claim that prefix-matched GPT-4.1 continuations versus human gold-standard text isolates AI word-choice biases rests on the untested assumption that systematic differences in output properties (sentence length, formality, discourse structure, or residual topic mismatch) do not drive the observed log prevalence ratios. Because both the cross-lingual convergence count (24/34 languages) and the diachronic +15.1% vs. -4.5% contrast are derived directly from these ratios, any residual confound would undermine the central interpretation that the patterns reflect AI influence rather than generation artifacts.
minor comments (2)
  1. [Abstract] The abstract reports a mean prevalence change of +15.1% without accompanying standard deviation, confidence interval, or per-language distribution; adding these would strengthen the diachronic claim.
  2. [Abstract] The phrase 'matched baseline words' is used without an explicit definition of the matching procedure (e.g., frequency, length, or semantic similarity); a brief clarification would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying a key assumption in the split-halves diagnostic. We address the concern directly below and outline targeted revisions.

read point-by-point responses
  1. Referee: [Methods / abstract] The split-halves continuation diagnostic (abstract and methods): the claim that prefix-matched GPT-4.1 continuations versus human gold-standard text isolates AI word-choice biases rests on the untested assumption that systematic differences in output properties (sentence length, formality, discourse structure, or residual topic mismatch) do not drive the observed log prevalence ratios. Because both the cross-lingual convergence count (24/34 languages) and the diachronic +15.1% vs. -4.5% contrast are derived directly from these ratios, any residual confound would undermine the central interpretation that the patterns reflect AI influence rather than generation artifacts.

    Authors: We agree that the diagnostic rests on an assumption that merits explicit testing. Prefix matching from the same human news texts substantially reduces topic mismatch, and the use of log prevalence ratios within continuations further focuses on word choice rather than global properties. Our reported validations across seeds, model families, and data sizes show stable lemma rankings, which would be unlikely if the signal were driven primarily by uniform generation artifacts. Nevertheless, we did not directly measure or control for differences in sentence length, formality, or discourse structure. In revision we will add (i) length-normalized prevalence ratios, (ii) formality scoring of continuations versus human text, and (iii) a comparison of discourse markers to quantify any residual stylistic differences. These controls will be reported alongside the existing cross-lingual and diachronic results. If the patterns persist after these adjustments, the interpretation that the shifts reflect AI lexical preferences will be strengthened; if not, we will qualify the claims accordingly. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central quantities are log prevalence ratios computed directly from GPT-4.1 continuations versus matched human gold-standard text drawn from the external WMT News Crawl corpus. Convergence counts (e.g., 'emphasize'-type verbs in 24 languages) and diachronic prevalence changes (+15.1% vs. -4.5% baseline) are simple empirical tallies and comparisons of these ratios across languages and time periods. No equations, fitted parameters, or self-citations reduce these observations to the input data by construction; the split-halves diagnostic is presented as an empirical method whose outputs are then measured against independent benchmarks. The derivation chain is therefore self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on the representativeness of the WMT News Crawl corpus for each language's news writing and on the assumption that the continuation diagnostic isolates AI lexical effects. No free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption The WMT News Crawl corpus provides representative samples of news writing for each of the 34 languages.
    The corpus is the sole data source for both the continuation comparisons and the diachronic tracking.
  • domain assumption The split-halves continuation diagnostic using GPT-4.1 produces text whose lexical differences from human gold-standard text are attributable to AI generation rather than other factors.
    This diagnostic is the basis for deriving the ranked AI-overused lemmas and all subsequent convergence and uptake claims.

pith-pipeline@v0.9.1-grok · 5792 in / 1548 out tokens · 42724 ms · 2026-06-29T22:53:58.365955+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 15 canonical work pages · 5 internal anchors

  1. [1]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Ai suggestions homogenize writing toward western styles and diminish cultural nuances. In Pro- ceedings of the 2025 CHI conference on human fac- tors in computing systems, pages 1–21. Barrett R Anderson, Jash Hemant Shah, and Max Kreminski. 2024. Homogenization effects of large 11 language models on human creative ideation. In Pro- ceedings of the 16th co...

  2. [2]

    arXiv preprint arXiv:2506.05339 , year =

    Examining linguistic shifts in academic writ- ing before and after the launch of chatgpt: a study on preprint papers. Scientometrics, 130(7):3597–3627. Michele Bevilacqua and Roberto Navigli. 2020. Break- ing through the 80% glass ceiling: Raising the state of the art in word sense disambiguation by incorpo- rating knowledge graph information. In Proceedi...

  3. [3]

    arXiv preprint arXiv:2304.04736

    On the possibilities of ai-generated text de- tection. arXiv preprint arXiv:2304.04736. Jay Chooi. 2026. Stylistic transfer from annotator com- munities to large language models. In Proceedings of the 10th Joint SIGHUM Workshop on Computa- tional Linguistics for Cultural Heritage, Social Sci- ences, Humanities and Literature 2026 , pages 135– 145. Paul F ...

  4. [4]

    In Pro- ceedings of the 24th conference on computational natural language learning, pages 609–619

    Cloze distillation: Improving neural language models with human next-word prediction. In Pro- ceedings of the 24th conference on computational natural language learning, pages 609–619. Sarah Fitterer, Dominik Gangl, and Jannes Ulbrich

  5. [5]

    In Proceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 4: Student Research Workshop) , pages 1239–1245

    Testing english news articles for lexical ho- mogenization due to widespread use of large lan- guage models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Lin- guistics (Volume 4: Student Research Workshop) , pages 1239–1245. 12 Iason Gabriel. 2020. Artificial intelligence, values, and alignment: I. gabriel. Minds and mach...

  6. [6]

    contamination

    Exploring the structure of ai-induced lan- guage change in scientific english. arXiv preprint arXiv:2506.21817. Sebastian Gehrmann, Hendrik Strobelt, and Alexan- der M Rush. 2019. Gltr: Statistical detection and visualization of generated text. In Proceedings of the 57th annual meeting of the association for com- putational linguistics: system demonstrati...

  7. [7]

    arXiv preprint arXiv:2603.25638

    Beyond via: Analysis and estimation of the impact of large language models in academic papers. arXiv preprint arXiv:2603.25638. Mingmeng Geng and Thierry Poibeau. 2025. On the de- tectability of llm-generated text: What exactly is llm- generated text? arXiv preprint arXiv:2510.20810. Mingmeng Geng and Roberto Trotta. 2024. Is chat- gpt transforming academ...

  8. [8]

    In Proceedings of the 54th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers) , pages 1489–1501

    Diachronic word embeddings reveal statisti- cal laws of semantic change. In Proceedings of the 54th Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers) , pages 1489–1501. Hans W A Hanley and Zakir Durumeric. 2024. Machine-made media: Monitoring the mobilization of machine-generated articles on misinformation and mains...

  9. [9]

    Scientific re- ports, 13(1):5487

    Artificial intelligence in communication im- pacts language and social relationships. Scientific re- ports, 13(1):5487. Yifei Huang, Jiuxin Cao, Hanyu Luo, Xin Guan, and Bo Liu. 2025. Magret: Machine-generated text detection with rewritten texts. In Proceedings of the 31st International Conference on Computational Linguistics, pages 8336–8346. Yueyue Huan...

  10. [10]

    Proceedings of the National Academy of Sciences, 120(11):e2208839120

    Human heuristics for ai-generated language are flawed. Proceedings of the National Academy of Sciences, 120(11):e2208839120. Houji Jin, Negin Ashrafi, Armin Abdollahi, Wei Liu, Jian Wang, Ganyu Gui, Maryam Pishgar, and Huang- hao Feng. 2025a. Llm encoder vs. decoder: Ro- bust detection of chinese ai-generated text with lora. arXiv preprint arXiv:2509.0073...

  11. [11]

    Lengua y Sociedad , 23(2):895–910

    Análisis léxico de textos generados por mod- elos de lenguaje: reflejo de sus modelos de mundo. Lengua y Sociedad , 23(2):895–910. Kayvan Kousha and Mike Thelwall. 2025. How much are llms changing the language of academic papers after chatgpt? a multi-database and full text analysis. arXiv preprint arXiv:2509.09596. Raphail Krichevsky and Victor Trofimov....

  12. [12]

    Pan, 8(27-31):4

    Detecting fake content with relative entropy scoring. Pan, 8(27-31):4. Christoph Leiter, Jonas Belouadi, Y anran Chen, Ran Zhang, Daniil Larionov, Aida Kostikova, and Steffen Eger. 2024. Nllg quarterly arxiv report 09/24: What are the most influential current ai papers? arXiv preprint arXiv:2412.12121. Chih- Yuan Li, Soon Ae Chun, and James Geller. 2024a....

  13. [13]

    Computers in Human Behavior: Artificial Humans , page 100207

    Homogenizing effect of large language mod- els (llms) on creative diversity: An empirical com- parison of human and chatgpt writing. Computers in Human Behavior: Artificial Humans , page 100207. Sonia Krishna Murthy, Tomer Ullman, and Jennifer Hu

  14. [14]

    One fish, two fish, but not the whole sea: Alignment reduces language models’ conceptual di- versity. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Associ- ation for Computational Linguistics: Human Lan- guage Technologies (Volume 1: Long Papers), pages 11241–11258. Roberto Navigli and Simone Paolo Ponzetto. 2012. Ba-...

  15. [15]

    Advances in neural information processing systems, 36:53728– 53741

    Direct preference optimization: Y our lan- guage model is secretly a reward model. Advances in neural information processing systems, 36:53728– 53741. Rolf Reber, Norbert Schwarz, and Piotr Winkielman

  16. [16]

    Can AI-Generated Text be Reliably Detected?

    Processing fluency and aesthetic pleasure: Is beauty in the perceiver’s processing experience? Personality and social psychology review, 8(4):364– 382. Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual us- ing knowledge distillation. In Proceedings of the 2020 conference on empirical methods in natural lan- guage p...

  17. [17]

    International Journal of Speech Technology, 27(4):935–956

    Classification of human-and ai-generated texts for different languages and domains. International Journal of Speech Technology, 27(4):935–956. Dominik Schlechtweg, Barbara McGillivray, Simon Hengchen, Haim Dubossarsky, and Nina Tahmasebi

  18. [18]

    Towards Understanding Sycophancy in Language Models

    Semeval-2020 task 1: Unsupervised lexical semantic change detection. In Proceedings of the fourteenth workshop on semantic evaluation , pages 1–23. V Schmalz and Anaïs Tack. 2025. Can gptzero’s ai vocabulary distinguish between llm-generated and student-written essays. Kochmar, E.; Alhafni, B.; Bexte, M.; Burstein, J, pages 937–952. Mrinank Sharma, Meg To...

  19. [19]

    The Curse of Recursion: Training on Generated Data Makes Models Forget

    The curse of recursion: Training on gen- erated data makes models forget. arXiv preprint arXiv:2305.17493. Ilia Shumailov, Zakhar Shumaylov, Yiren Zhao, Nico- las Papernot, Ross Anderson, and Y arin Gal. 2024. Ai models collapse when trained on recursively gen- erated data. Nature, 631(8022):755–759. Olivia Sidoti and Colleen McClain. 2025. 34% of u.s. ad...

  20. [20]

    Antonio Marcio Da Silva and Lucia Rottava

    Pew Research Center. Antonio Marcio Da Silva and Lucia Rottava. 2024. Densidade lexical em textos gerados pelo chatgpt: implicações da inteligência artificial para a escrita em línguas adicionais. Texto Livre, 17:e47836. Zhivar Sourati, Alireza S Ziabari, and Morteza De- hghani. 2025. The homogenizing effect of large language models on human expression an...

  21. [21]

    Simple synthetic data reduces sycophancy in large language models

    Testing of detection tools for ai-generated text. International Journal for Educational Integrity, 19(1):1–39. Jerry Wei, Da Huang, Yifeng Lu, Denny Zhou, and Quoc V Le. 2023. Simple synthetic data reduces sycophancy in large language models. arXiv preprint arXiv:2308.03958. Emily Wenger and Y oed Kenett. 2025. We’re different, we’re the same: Creative ho...

  22. [22]

    Do NOT repeat the input text

  23. [23]

    If the input text is machine-formatted or technical data, output an empty string ("") and nothing else

  24. [24]

    Here is the continuation

    Do NOT provide any conversational preface or acknowledgments (e.g., "Here is the continuation...")

  25. [25]

    Output ONLY natural language

  26. [26]

    User prompts

    Output ONLY the continuation text. User prompts. English: Provide a continuation of this English news text, without preamble, continue directly:\n\n[prompt text ] Dutch: Schrijf het vervolg van deze Nederlandse nieuwstekst, zonder inleiding, ga direct door:\n\n[ prompt text] Language-specific user prompts are provided in the GitHub repository. C Diachroni...