From 124 Million Tokens to 1,021 Neologisms: A Large-Scale Pipeline for Automatic Neologism Detection

Diego Rossini; Lonneke van der Plas

arxiv: 2605.06426 · v1 · submitted 2026-05-07 · 💻 cs.CL

From 124 Million Tokens to 1,021 Neologisms: A Large-Scale Pipeline for Automatic Neologism Detection

Diego Rossini , Lonneke van der Plas This is my paper

Pith reviewed 2026-05-08 10:11 UTC · model grok-4.3

classification 💻 cs.CL

keywords neologism detectionlexical innovationword formationLLM classificationReddit corpusautomatic pipelinemorphology frameworksneologism candidates

0 comments

The pith

A pipeline reduces 124 million tokens from Reddit posts to 1,021 neologism candidates, 59 percent of which manual review confirms as genuine lexical innovations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a scalable method to find new words across enormous collections of online text without requiring humans to scan everything. It draws on two complementary theories of word formation to create rule-based filters and a four-class scheme that labels candidates as neologisms, entities, foreign words, or none. The system processes 527 million Reddit posts spanning two decades and shrinks 124.6 million unique tokens down by more than 99.99 percent. Multiple large language models vote on each remaining item, after which experts annotate the full set of 1,021 candidates. The outcome shows that the pipeline can surface real language change at practical scale.

Core claim

The authors present a modular pipeline for automatic neologism detection that integrates rule-based filtering based on grammatical and extra-grammatical morphology with LLM classification using a four-class scheme. Processing 124.6 million unique tokens from 527 million English Reddit posts yields 1,021 candidates. Majority vote across multiple LLMs followed by manual annotation establishes that 599 candidates, or 58.7 percent, qualify as genuine lexical innovations.

What carries the argument

The modular pipeline that combines rule-based filtering grounded in grammatical and extra-grammatical morphology frameworks with LLM majority-vote classification under a four-class scheme to identify neologisms.

If this is right

The approach reduces the search space by over 99.99 percent while leaving a set small enough for full manual verification.
Multiple LLMs show substantial disagreement when classifying neologism candidates, highlighting operational challenges at scale.
The released pipeline code and annotated candidate list enable direct reproduction and extension on other corpora.
The method demonstrates that rule-based filters informed by word-formation theory can complement LLM judgments for lexical innovation tracking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pipeline could be tested on non-English social media data to check how well the morphology frameworks transfer across languages.
Disagreements among LLMs point to a possible need for fine-tuning models specifically on examples of linguistic novelty.
Periodic re-application of the pipeline to newer Reddit slices might allow construction of a timeline of when specific neologisms first appear.

Load-bearing premise

That the grammatical and extra-grammatical morphology frameworks together with the four-class scheme provide a sufficient and non-circular operational definition of what counts as a neologism for large-scale detection.

What would settle it

Running the pipeline on a controlled corpus of only pre-existing established words and known neologisms and finding that it either misses most known innovations or outputs many false positives would falsify its claimed effectiveness.

read the original abstract

We present a scalable, modular pipeline for automatic neologism detection that combines rule-based filtering with LLM classification. The pipeline is grounded in two complementary word-formation frameworks, grammatical and extra-grammatical morphology, which jointly define the scope of what counts as a neologism and inform a four-class classification scheme (neologism, entity, foreign, none). While designed to be modular and transferable at the architectural level, the pipeline is instantiated on 527 million English-language Reddit posts spanning 2005-2024. From this corpus, we extract 124.6 million unique tokens and reduce them by over 99.99% to yield 1,021 neologism candidates, a set small enough for manual expert verification. Multiple LLMs independently classify each candidate via majority vote, with a final verification step, revealing substantial cross-model disagreement and highlighting the challenge of operationalizing neologism detection at scale. Manual annotation of all 1,021 candidates confirms that 599 (58.7%) are genuine lexical innovations. The pipeline code, vocabulary compilation scripts, and the annotated candidate list are available at https://github.com/DiegoRossini/neologism-pipeline.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a workable pipeline that shrinks 124 million Reddit tokens to 1,021 candidates with 59% manual confirmation and open code, but the confirmation step may not be independent of the morphological rules used for filtering.

read the letter

The main takeaway is a concrete, large-scale method for spotting new words in social media. They start with 527 million Reddit posts, pull 124.6 million unique tokens, apply rule-based filters drawn from grammatical and extra-grammatical morphology, run multiple LLMs with majority vote, and end up with 1,021 candidates that receive full expert annotation, yielding 599 confirmed neologisms at 58.7%. The reduction numbers are explicit, the cross-model disagreement is reported, and the code plus annotated list are released on GitHub. That combination of scale, transparency, and reusability is the real contribution here.

Referee Report

1 major / 2 minor

Summary. The paper presents a modular pipeline for automatic neologism detection that integrates rule-based filtering grounded in grammatical and extra-grammatical morphology frameworks with LLM-based classification. Applied to 527 million Reddit posts (2005–2024), the pipeline reduces 124.6 million unique tokens by over 99.99% to 1,021 candidates. Multiple LLMs classify candidates via majority vote, followed by full manual expert annotation that identifies 599 (58.7%) as genuine lexical innovations. The code, scripts, and annotated list are released openly.

Significance. If the verification process is shown to be independent, the work supplies a reproducible, large-scale method and a sizable annotated dataset for studying lexical innovation in social media. The explicit quantification of reduction steps, use of multiple LLMs, and open release of resources are concrete strengths that support transferability and further research in computational lexicography and NLP.

major comments (1)

[Manual annotation / verification step] Manual annotation description (the section detailing the four-class scheme and expert verification): the paper must specify the exact annotation guidelines provided to experts and whether they incorporate external, independent criteria (e.g., dictionary absence, first-attestation dating outside the corpus, or semantic novelty judged without reference to the morphology frameworks). If annotators apply only the same neologism/entity/foreign/none taxonomy derived from the grammatical/extra-grammatical frameworks used for filtering, the 58.7% figure risks confirming internal consistency rather than external validity of the detected set.

minor comments (2)

[Abstract and §1] The abstract and introduction should more clearly distinguish the pipeline's architectural modularity from the specific morphological frameworks chosen for this instantiation, to help readers assess transferability.
[LLM classification results] Table or figure reporting cross-model disagreement rates would benefit from explicit counts or percentages alongside the qualitative statement of 'substantial disagreement'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the positive assessment of our work and the recommendation for minor revision. We address the major comment point by point below.

read point-by-point responses

Referee: [Manual annotation / verification step] Manual annotation description (the section detailing the four-class scheme and expert verification): the paper must specify the exact annotation guidelines provided to experts and whether they incorporate external, independent criteria (e.g., dictionary absence, first-attestation dating outside the corpus, or semantic novelty judged without reference to the morphology frameworks). If annotators apply only the same neologism/entity/foreign/none taxonomy derived from the grammatical/extra-grammatical frameworks used for filtering, the 58.7% figure risks confirming internal consistency rather than external validity of the detected set.

Authors: We agree that the manuscript should provide greater transparency on the annotation process. The revised version will include the exact annotation guidelines in a new appendix. These guidelines present the four-class taxonomy with definitions drawn directly from the grammatical and extra-grammatical morphology frameworks that underpin the entire pipeline. Annotators were instructed to apply this taxonomy to each candidate using their expert linguistic judgment. We acknowledge that this constitutes an application of the same definitional framework used for candidate filtering rather than an entirely separate external validation (such as mandatory dictionary checks or independent first-attestation dating). At the same time, because the annotators are independent experts unaffiliated with the pipeline design, the 58.7% figure reflects the proportion of candidates that qualified as genuine lexical innovations under consistent expert application of the operational definition. We will add a brief discussion clarifying this point and noting that the released annotated list enables future work to compare against dictionary-based or temporally external benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline with external manual verification

full rationale

The paper presents a modular detection pipeline that applies rule-based filtering derived from grammatical and extra-grammatical morphology frameworks to reduce 124.6 million tokens to 1,021 candidates, followed by LLM majority-vote classification and final manual annotation. The headline result (599 genuine neologisms) rests on this external manual verification step rather than any fitted parameters, self-referential equations, or load-bearing self-citations. No derivation chain reduces the output to the input definitions by construction; the frameworks inform candidate selection but the confirmation count is independently annotated. This is a standard empirical pipeline with no mathematical or definitional circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The pipeline rests on established linguistic frameworks without introducing new free parameters or invented entities; the four-class scheme is derived from standard morphological theory.

axioms (1)

domain assumption Grammatical and extra-grammatical morphology frameworks jointly define the scope of neologisms and inform the four-class classification scheme.
Invoked to ground the rule-based filtering and LLM classification categories.

pith-pipeline@v0.9.0 · 5519 in / 1226 out tokens · 60643 ms · 2026-05-08T10:11:35.761305+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 2 canonical work pages · 1 internal anchor

[1]

From 124 Million Tokens to 1,021 Neologisms: A Large-Scale Pipeline for Automatic Neologism Detection

Introduction Although the study of neologisms has deep roots in linguistics (Guilbert, 1975; Rey, 1976), their au- tomatic detection is a comparatively recent task. Computational approaches only became feasible once large machine-readable corpora were avail- ableinthe1990s(Renouf,1993;CabréanddeYza- guirre, 1995). Since then, a number of web-based platfor...

work page internal anchor Pith review Pith/arXiv arXiv 1975
[2]

Related Work The dominant paradigm for automatic neologism detection remains theexclusion dictionary method: a token is flagged as a candidate neologism if it does not appear in one or more reference lexicons (Renouf, 1993; Cabré and de Yzaguirre, 1995). This principle underpins the major detection plat- forms developed over the past two decades, in- clud...

1993
[3]

nonce-formation

Theoretical Foundations Any neologism detection pipeline presupposes an operational definition of what counts as a neolo- gism. Thissectionpresentsthetwoword-formation frameworksthatjointlyinformthedesignofourclas- sification scheme and, in particular, determine the scope of theneologismlabel assigned by the LLM stage (§4). 3.1. Grammatical Word Formation...

1998
[4]

The instantiation described below targets English

Methodology The pipeline is designed to be modular: the se- quence of filtering stages is pre-determined, but the resources each stage operates on (reference vocabularies, phonotactic rules, frequency dictio- naries) are language-specific and must be substi- tuted for each target language. The instantiation described below targets English. 4.1. Tokenizati...

2012
[5]

All language-specific resources, parameters,andmodelchoicesreportedbelowcan be substituted for other languages or corpora

Experimental Setup This section describes the instantiation of the pipeline for English-language neologism detection on Reddit data. All language-specific resources, parameters,andmodelchoicesreportedbelowcan be substituted for other languages or corpora. 5.1. Corpus The corpus consists of Reddit submissions and comments spanning January 2005 to December ...

2005
[6]

Filtering Cascade Table 2 reports the number of candidate tokens surviving each pipeline stage

Results 6.1. Filtering Cascade Table 2 reports the number of candidate tokens surviving each pipeline stage. The rule-based stages reduce the initial 124.6 million unique to- kens by 99.86%, yielding 174,973 candidates for LLM classification. The most aggressive single stage is pattern cleaning, which removes 90 million tokens (72.2% of the input at that ...

2013
[7]

90 Words

Discussion The pipeline is best understood as a high-recall candidate generator rather than a precision classi- fier. Its primary contribution is the 122,031:1 com- pression ratio, which reduces a task that no hu- man annotator could feasibly undertake (reviewing 124.6 million tokens) to one that a single annotator can complete (reviewing 1,021 candidates...

2025
[8]

Applied to 527 million Reddit posts, the pipeline achieves a 122,031:1 compression ratio, yielding 1,021 candidates of which 599 (58.7%) are genuine lexical innovations

Conclusion We presented a scalable pipeline for automatic ne- ologismdetectionthatcombinesrule-basedfiltering with multi-model LLM classification, grounded in grammaticalandextra-grammaticalword-formation theory. Applied to 527 million Reddit posts, the pipeline achieves a 122,031:1 compression ratio, yielding 1,021 candidates of which 599 (58.7%) are gen...

2023
[9]

Ti blocco perché sei un trollazzo

Bibliographical References AdamAleksic.2025.Algospeak: HowSocialMedia Is Transforming the Future of Language. Knopf, New York. Sabine Arndt-Lappe. 2015. Word-formation and analogy. In Peter O. Müller, Ingeborg Ohnheiser, Susan Olsen, and Franz Rainer, editors,Word- Formation: An International Handbook of the LanguagesofEurope,volume2,pages822–841. De Gruy...

2025
[10]

Louis Guilbert

Mapping Lexical Innovation on Ameri- can Social Media.Journal of English Linguistics, 46(4):293–319. Louis Guilbert. 1975.La créativité lexicale. Langue et Langage. Larousse, Paris. Peter Hohenhaus. 1998. Non-lexicalizability as a characteristic feature of nonce word-formation in English and German.Lexicology, 4(2):237–280. Daphné Kerremans, Jelena Prokić...

1975
[11]

InLexicography in the Digital Age, pages 559–569

New German words: Detection, descrip- tion, and dictionary entry. InLexicography in the Digital Age, pages 559–569. Euralex. Lívia Körtvélyessy, Pavol Štekauer, and Pavol Kačmár. 2021. On the role of creativity in the formation of new complex words.Linguistics, 59(4):1017–1055. Lívia Körtvélyessy, Pavol Štekauer, and Pavol Kačmár. 2022.Creativity in Word ...

2021
[12]

Pavol Štekauer

You can (not) say what you want: Us- ing algospeak to contest and evade algorithmic content moderation on TikTok.Social Media + Society, 9(3). Pavol Štekauer. 2001. Fundamental principles of an onomasiological theory of English word- formation.Onomasiology Online, 2:1–42. Pius ten Hacken and Renáta Panocová, editors. 2020.The Interaction of Borrowing and ...

work page arXiv 2001
[13]

The Pushshift Reddit dataset

Language Resource References Jason Baumgartner, Savvas Zannettou, Brian Kee- gan,MeganSquire,andJeremyBlackburn.2020. The Pushshift Reddit dataset. Wolf Garbe. 2012. SymSpell: Symmetric delete spelling correction algorithm. Princeton University. 2011. WordNet 3.1. Peter M. Stahl. 2022. Lingua: The most accurate natural language detection library for pytho...

2020
[14]

Derived forms are NEOLOGISM (youtuber -> NEOLOGISM, youtube -> ENTITY)
[15]

When uncertain, classify as NONE
[16]

<text>" context_2 (r/<subreddit>):

Use the context and subreddit to understand usage TOKENS: TOKEN: <token_1> context_1 (r/<subreddit>): "<text>" context_2 (r/<subreddit>): "<text>" context_3 (r/<subreddit>): "<text>" TOKEN: <token_2> context_1 (r/<subreddit>): "<text>" ... OUTPUT: One classification per line as TOKEN:LABEL (ENTITY, NEOLOGISM, FOREIGN, or NONE). No explanations. Single-tok...

2015

[1] [1]

From 124 Million Tokens to 1,021 Neologisms: A Large-Scale Pipeline for Automatic Neologism Detection

Introduction Although the study of neologisms has deep roots in linguistics (Guilbert, 1975; Rey, 1976), their au- tomatic detection is a comparatively recent task. Computational approaches only became feasible once large machine-readable corpora were avail- ableinthe1990s(Renouf,1993;CabréanddeYza- guirre, 1995). Since then, a number of web-based platfor...

work page internal anchor Pith review Pith/arXiv arXiv 1975

[2] [2]

Related Work The dominant paradigm for automatic neologism detection remains theexclusion dictionary method: a token is flagged as a candidate neologism if it does not appear in one or more reference lexicons (Renouf, 1993; Cabré and de Yzaguirre, 1995). This principle underpins the major detection plat- forms developed over the past two decades, in- clud...

1993

[3] [3]

nonce-formation

Theoretical Foundations Any neologism detection pipeline presupposes an operational definition of what counts as a neolo- gism. Thissectionpresentsthetwoword-formation frameworksthatjointlyinformthedesignofourclas- sification scheme and, in particular, determine the scope of theneologismlabel assigned by the LLM stage (§4). 3.1. Grammatical Word Formation...

1998

[4] [4]

The instantiation described below targets English

Methodology The pipeline is designed to be modular: the se- quence of filtering stages is pre-determined, but the resources each stage operates on (reference vocabularies, phonotactic rules, frequency dictio- naries) are language-specific and must be substi- tuted for each target language. The instantiation described below targets English. 4.1. Tokenizati...

2012

[5] [5]

All language-specific resources, parameters,andmodelchoicesreportedbelowcan be substituted for other languages or corpora

Experimental Setup This section describes the instantiation of the pipeline for English-language neologism detection on Reddit data. All language-specific resources, parameters,andmodelchoicesreportedbelowcan be substituted for other languages or corpora. 5.1. Corpus The corpus consists of Reddit submissions and comments spanning January 2005 to December ...

2005

[6] [6]

Filtering Cascade Table 2 reports the number of candidate tokens surviving each pipeline stage

Results 6.1. Filtering Cascade Table 2 reports the number of candidate tokens surviving each pipeline stage. The rule-based stages reduce the initial 124.6 million unique to- kens by 99.86%, yielding 174,973 candidates for LLM classification. The most aggressive single stage is pattern cleaning, which removes 90 million tokens (72.2% of the input at that ...

2013

[7] [7]

90 Words

Discussion The pipeline is best understood as a high-recall candidate generator rather than a precision classi- fier. Its primary contribution is the 122,031:1 com- pression ratio, which reduces a task that no hu- man annotator could feasibly undertake (reviewing 124.6 million tokens) to one that a single annotator can complete (reviewing 1,021 candidates...

2025

[8] [8]

Applied to 527 million Reddit posts, the pipeline achieves a 122,031:1 compression ratio, yielding 1,021 candidates of which 599 (58.7%) are genuine lexical innovations

Conclusion We presented a scalable pipeline for automatic ne- ologismdetectionthatcombinesrule-basedfiltering with multi-model LLM classification, grounded in grammaticalandextra-grammaticalword-formation theory. Applied to 527 million Reddit posts, the pipeline achieves a 122,031:1 compression ratio, yielding 1,021 candidates of which 599 (58.7%) are gen...

2023

[9] [9]

Ti blocco perché sei un trollazzo

Bibliographical References AdamAleksic.2025.Algospeak: HowSocialMedia Is Transforming the Future of Language. Knopf, New York. Sabine Arndt-Lappe. 2015. Word-formation and analogy. In Peter O. Müller, Ingeborg Ohnheiser, Susan Olsen, and Franz Rainer, editors,Word- Formation: An International Handbook of the LanguagesofEurope,volume2,pages822–841. De Gruy...

2025

[10] [10]

Louis Guilbert

Mapping Lexical Innovation on Ameri- can Social Media.Journal of English Linguistics, 46(4):293–319. Louis Guilbert. 1975.La créativité lexicale. Langue et Langage. Larousse, Paris. Peter Hohenhaus. 1998. Non-lexicalizability as a characteristic feature of nonce word-formation in English and German.Lexicology, 4(2):237–280. Daphné Kerremans, Jelena Prokić...

1975

[11] [11]

InLexicography in the Digital Age, pages 559–569

New German words: Detection, descrip- tion, and dictionary entry. InLexicography in the Digital Age, pages 559–569. Euralex. Lívia Körtvélyessy, Pavol Štekauer, and Pavol Kačmár. 2021. On the role of creativity in the formation of new complex words.Linguistics, 59(4):1017–1055. Lívia Körtvélyessy, Pavol Štekauer, and Pavol Kačmár. 2022.Creativity in Word ...

2021

[12] [12]

Pavol Štekauer

You can (not) say what you want: Us- ing algospeak to contest and evade algorithmic content moderation on TikTok.Social Media + Society, 9(3). Pavol Štekauer. 2001. Fundamental principles of an onomasiological theory of English word- formation.Onomasiology Online, 2:1–42. Pius ten Hacken and Renáta Panocová, editors. 2020.The Interaction of Borrowing and ...

work page arXiv 2001

[13] [13]

The Pushshift Reddit dataset

Language Resource References Jason Baumgartner, Savvas Zannettou, Brian Kee- gan,MeganSquire,andJeremyBlackburn.2020. The Pushshift Reddit dataset. Wolf Garbe. 2012. SymSpell: Symmetric delete spelling correction algorithm. Princeton University. 2011. WordNet 3.1. Peter M. Stahl. 2022. Lingua: The most accurate natural language detection library for pytho...

2020

[14] [14]

Derived forms are NEOLOGISM (youtuber -> NEOLOGISM, youtube -> ENTITY)

[15] [15]

When uncertain, classify as NONE

[16] [16]

<text>" context_2 (r/<subreddit>):

Use the context and subreddit to understand usage TOKENS: TOKEN: <token_1> context_1 (r/<subreddit>): "<text>" context_2 (r/<subreddit>): "<text>" context_3 (r/<subreddit>): "<text>" TOKEN: <token_2> context_1 (r/<subreddit>): "<text>" ... OUTPUT: One classification per line as TOKEN:LABEL (ENTITY, NEOLOGISM, FOREIGN, or NONE). No explanations. Single-tok...

2015