pith. sign in

arxiv: 2606.17299 · v1 · pith:QPM6OPMDnew · submitted 2026-06-15 · 💻 cs.CL

Examining the Limits of Word2Vec with Toki Pona

Pith reviewed 2026-06-27 02:58 UTC · model grok-4.3

classification 💻 cs.CL
keywords Word2VecToki Ponaword embeddingsdistributional semanticslow-resource languagesvocabulary sizecorpus noisesemantic clustering
0
0 comments X

The pith

Word2Vec captures semantic structure in Toki Pona's 130-word lexicon when trained on large text volumes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests Word2Vec on Toki Pona, a constructed language limited to roughly 130 words, to determine if embeddings can still reflect semantic relationships. Researchers gathered 1.4 million sentences totaling 7.95 million tokens and trained two versions of the model, one keeping incidental non-Toki Pona tokens and one removing them. They measured performance through word proximity to category centroids, silhouette scores from agglomerative clustering, and representational similarity matrices compared to English embeddings. Results show the model succeeds by relying on distributional patterns in usage rather than requiring a large vocabulary, and that sparse extra tokens tighten clusters without altering overall structure. This setup directly probes the lower limits of what word embeddings need to function.

Core claim

Word2Vec successfully generates embeddings that capture semantic relationships in Toki Pona despite its extreme vocabulary reduction to approximately 130 words. Training on 1.4 million sentences reveals that effectiveness stems primarily from distributional patterns in the corpus rather than lexicon size. Retaining incidental tokens such as named entities and loanwords draws similar words closer together in vector space while leaving the relative embedding structure intact, as confirmed by centroid proximity measures, agglomerative clustering silhouette scores, and similarity matrices aligned with English.

What carries the argument

Two parallel Word2Vec models on the same Toki Pona corpus, one retaining and one filtering non-core tokens, evaluated via semantic category centroid proximity, agglomerative clustering silhouette scores, and English representational similarity matrices.

If this is right

  • Embeddings remain stable in relative structure when sparse non-core tokens are retained in training data.
  • Incidental tokens improve the tightness of similar-word clusters without harming overall organization.
  • Semantic relationships in embeddings arise chiefly from co-occurrence statistics rather than total unique word count.
  • Word2Vec can be applied to other minimal-vocabulary constructed or low-resource languages given adequate text volume.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The findings point toward testing the same approach on other constructed languages or pidgins to see if corpus size consistently overrides lexicon limits.
  • One could examine whether the observed benefit from incidental tokens holds when the extra items are systematically varied in type or frequency.
  • This setup invites direct comparison with other embedding methods to check if the pattern is specific to Word2Vec or general to distributional models.
  • Results suggest that data collection efforts for small languages should prioritize volume of usage examples over vocabulary expansion.

Load-bearing premise

The chosen quantitative metrics of centroid proximity, silhouette scores, and similarity matrices to English measure genuine semantic capture instead of artifacts from the small vocabulary or mixed corpus composition.

What would settle it

Finding that the similarity matrices or cluster structures for the Toki Pona embeddings differ markedly from English patterns in a manner attributable to vocabulary size alone, or that removing incidental tokens produces substantially worse clustering scores.

Figures

Figures reproduced from arXiv: 2606.17299 by Daniel Zhenhan Huang, Hongchen Wu.

Figure 1
Figure 1. Figure 1: Distribution of non-Toki Pona tokens. Named [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Full-vocabulary semantic space comparison. Models are projected by UMAP with the cosine metric and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Representational similarity matrices with a subset of Toki Pona categories and corresponding English [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Complete representational similarity matrices comparing the semantic structure of all 27 defined categories [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

Word2Vec's effectiveness at generating semantic embeddings has been widely validated, yet it has been tested almost exclusively on languages with large vocabulary inventories. This study examines whether Word2Vec can successfully capture semantic relationships within an extremely reduced vocabulary using data from Toki Pona, a constructed language with approximately 130 words. We sourced 1.4 million sentences (7.95 million tokens) from the Toki Pona community for training. Approximately 23% of sentences in the corpus contain non-Toki Pona tokens such as named entities, loanwords, and neologisms. To investigate whether this linguistic noise enhances or hinders performance -- a topic rarely addressed in word embedding literature -- we trained two distinct models: one retaining these incidental tokens and another filtering them out completely. Evaluation was conducted using quantitative methods measuring word proximity to semantic category centroids, automated silhouette scores via agglomerative clustering, and qualitative analysis utilizing representational similarity matrices compared against English. The results indicate that while sparse, non-core tokens do not affect the relative structure of the learned embeddings, they actually draw similar words closer together in the vector space. Importantly, Word2Vec's effectiveness depends more on distributional patterns than lexicon size even at this extreme lower bound.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript trains Word2Vec on a corpus of 1.4 million Toki Pona sentences (7.95 million tokens) drawn from community sources to test whether semantic embeddings can be learned from an extremely small ~130-word lexicon. Two models are compared (one retaining incidental non-Toki Pona tokens, one filtering them), and evaluation uses three methods: proximity of words to semantic category centroids, silhouette scores from agglomerative clustering, and representational similarity matrices versus English embeddings. The central claim is that Word2Vec captures semantic relationships via distributional patterns even at this extreme lower bound, that incidental tokens do not disrupt relative structure (and may draw similar words closer), and that lexicon size is less important than distributional statistics.

Significance. If the quantitative evaluations are shown to measure semantics rather than corpus artifacts and the numerical results support the claims, the work would provide evidence that Word2Vec remains effective under extreme vocabulary reduction, using Toki Pona's deliberately broad senses as a strong test case. This could clarify the minimal conditions for distributional semantics and the robustness of embedding methods to lexicon size.

major comments (3)
  1. [Abstract] Abstract: the abstract states results from training and three evaluation approaches but provides no numerical values, error bars, statistical tests, or details on how category centroids were defined; the central claim therefore rests on unshown quantitative support.
  2. [Evaluation methods] Evaluation methods: proximity to semantic category centroids presupposes independently defined categories whose validity is not cross-checked against external criteria; with Toki Pona's broad, overlapping senses, this risks capturing usage regularities alone rather than independent semantic capture.
  3. [Evaluation methods] Evaluation methods: silhouette scores from agglomerative clustering on the embeddings themselves quantify clusterability of the learned space, which any sufficiently consistent co-occurrence model would produce, without establishing alignment with ground-truth semantics.
minor comments (2)
  1. [Abstract] The noisy-vs-clean ablation shows only that incidental tokens do not disrupt relative structure; it does not test whether that structure encodes meaning beyond frequency patterns.
  2. [Evaluation methods] The representational similarity matrices versus English compare two embedding spaces without establishing that either aligns with ground-truth semantics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of our evaluation approach. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the abstract states results from training and three evaluation approaches but provides no numerical values, error bars, statistical tests, or details on how category centroids were defined; the central claim therefore rests on unshown quantitative support.

    Authors: We agree that the abstract would be strengthened by including key numerical results. In the revised version, we will add specific values such as mean distances to category centroids, silhouette scores with standard deviations, and details on the definition of the 10 semantic categories (drawn from the official Toki Pona dictionary). Where appropriate, we will report statistical tests comparing the two models. revision: yes

  2. Referee: [Evaluation methods] Evaluation methods: proximity to semantic category centroids presupposes independently defined categories whose validity is not cross-checked against external criteria; with Toki Pona's broad, overlapping senses, this risks capturing usage regularities alone rather than independent semantic capture.

    Authors: The categories were constructed from the canonical Toki Pona word list and documented community usage to reflect the language's deliberately broad senses. To address the concern about external validation, we will add a mapping of these categories to English semantic equivalents and report alignment with resources such as WordNet synsets. This provides an independent check while preserving the focus on Toki Pona's reduced lexicon. revision: partial

  3. Referee: [Evaluation methods] Evaluation methods: silhouette scores from agglomerative clustering on the embeddings themselves quantify clusterability of the learned space, which any sufficiently consistent co-occurrence model would produce, without establishing alignment with ground-truth semantics.

    Authors: We acknowledge that silhouette scores primarily assess internal structure. In the manuscript, these scores are interpreted in conjunction with the predefined semantic categories and the representational similarity analysis against English embeddings, which serves as an external semantic reference. We will revise the methods and discussion sections to explicitly state how the three evaluation approaches are combined to link clusterability to semantic alignment rather than generic co-occurrence patterns. revision: partial

Circularity Check

0 steps flagged

No circularity; purely empirical training and evaluation

full rationale

The paper trains Word2Vec models on a Toki Pona corpus (1.4M sentences) and evaluates using standard metrics (centroid proximity, agglomerative clustering silhouette scores, and RSM vs. English). No equations, fitted parameters renamed as predictions, self-citations, or derivations are present. All reported outcomes follow directly from corpus statistics and off-the-shelf embedding training without any reduction of claims to inputs by construction. The study is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms or invented entities; the work relies on standard Word2Vec training assumptions and the validity of the three evaluation procedures, none of which are derived in the abstract.

pith-pipeline@v0.9.1-grok · 5741 in / 1035 out tokens · 43525 ms · 2026-06-27T02:58:58.476697+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 5 canonical work pages

  1. [1]

    Toki Pona: The Language of Good , publisher=

    Lang, Sonja , year=. Toki Pona: The Language of Good , publisher=

  2. [2]

    Spencer van der Meulen , title =

  3. [3]

    2013 , month=

    Efficient Estimation of Word Representations in Vector Space , author=. 2013 , month=. 1301.3781 , archivePrefix=

  4. [4]

    ma pona pi toki pona

  5. [5]

    2025 , publisher =

    Danielson, III, Gregory. 2025 , publisher =

  6. [6]

    2025 , month = sep, day = 20, publisher =

    Asi, kala , title =. 2025 , month = sep, day = 20, publisher =

  7. [7]

    Advances in Neural Information Processing Systems , volume=

    On the Dimensionality of Word Embedding , author=. Advances in Neural Information Processing Systems , volume=

  8. [8]

    2013 , eprint=

    Efficient Estimation of Word Representations in Vector Space , author=. 2013 , eprint=

  9. [9]

    1802.03426 , archivePrefix=

    Leland McInnes and John Healy and James Melville , year=. 1802.03426 , archivePrefix=

  10. [10]

    Computational Linguistics , volume =

    Hill, Felix and Reichart, Roi and Korhonen, Anna , title =. Computational Linguistics , volume =. 2015 , month =. doi:10.1162/COLI_a_00237 , url =

  11. [11]

    Placing search in context: The concept revisited , volume =

    Finkelstein, Lev and Gabrilovich, Evgeniy and Matias, Yossi and Rivlin, Ehud and Solan, Zach and Wolfman, Gadi and Ruppin, Eytan , year =. Placing search in context: The concept revisited , volume =. ACM Transactions on Information Systems - TOIS , doi =

  12. [12]

    Evaluation of Word Vector Representations by Subspace Alignment

    Tsvetkov, Yulia and Faruqui, Manaal and Ling, Wang and Lample, Guillaume and Dyer, Chris. Evaluation of Word Vector Representations by Subspace Alignment. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 2015. doi:10.18653/v1/D15-1243

  13. [13]

    ACM , author =

    Miller, George A. , title =. Commun. ACM , month = nov, pages =. 1995 , issue_date =. doi:10.1145/219717.219748 , abstract =

  14. [14]

    Augmenting E nglish Adjective Senses with Supersenses

    Tsvetkov, Yulia and Schneider, Nathan and Hovy, Dirk and Bhatia, Archna and Faruqui, Manaal and Dyer, Chris. Augmenting E nglish Adjective Senses with Supersenses. Proceedings of the Ninth International Conference on Language Resources and Evaluation ( LREC '14). 2014

  15. [15]

    Linguistic Issues in Language Technology , author=

    On Achieving and Evaluating Language-Independence in. Linguistic Issues in Language Technology , author=. 2011 , month=. doi:10.33011/lilt.v6i.1239 , abstractNote=

  16. [16]

    Semantic Drift in Multilingual Representations

    Beinborn, Lisa and Choenni, Rochelle. Semantic Drift in Multilingual Representations. Computational Linguistics. 2020. doi:10.1162/coli_a_00382