Lexical Consensus: Grounded Word Learning and Shared Meaning in Artificial Agents

Patricio M. Vera

arxiv: 2606.22207 · v1 · pith:K5STGB5Inew · submitted 2026-06-20 · 💻 cs.CL · cs.AI

Lexical Consensus: Grounded Word Learning and Shared Meaning in Artificial Agents

Patricio M. Vera This is my paper

Pith reviewed 2026-06-26 11:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords grounded word learninglexical consensusperceptual coherenceartificial agentsvisual embeddingsnonce wordsCIFAR-100bidirectional evaluation

0 comments

The pith

Artificial agents acquire grounded word meanings according to a perceptual-coherence gradient set by visual embedding geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Lexical Consensus as a framework for testing whether agents can acquire, generalize, and stabilize new lexical labels from visual experience alone. It pairs frozen DINOv2 embeddings with nonce words and measures learning across native categories, overextensions, and disjunctive concepts. The central finding is that acquisition follows perceptual distance: coherent groupings succeed while far-disjunctive ones fail, and a pre-registered dissociation shows semantic distance adds no predictive power once perceptual distance is accounted for. Bidirectional tests further separate naming accuracy from retrieval fidelity, with exemplar mechanisms outperforming prototypes in one direction. The results indicate that frozen perceptual structure both supports and constrains lexical grounding without further representational change.

Core claim

Agents learn artificial visual labels along a perceptual-coherence gradient in which native categories are acquired most readily, coherent overextensions remain learnable, mid-range disjunctives degrade, and far-disjunctive concepts approach chance performance; this gradient is driven by perceptual distance in the embedding space rather than semantic relatedness, as confirmed by partial R-squared values of 0.245 versus 0.002 in the CIFAR-100 dissociation test.

What carries the argument

Lexical Consensus framework, which uses frozen DINOv2 visual embeddings as the perceptual substrate, Carroll-style nonce words as labels, and bidirectional naming/retrieval tasks with linear and exemplar-based learners to isolate grounded lexical acquisition.

If this is right

Native and perceptually coherent categories reach high acquisition accuracy while far-disjunctive ones remain near chance.
Perceptual distance accounts for 24.5 percent of variance in learning outcomes after controlling for other factors.
Semantic distance contributes no detectable additional explanatory power once perceptual distance is included.
Exemplar-based mechanisms outperform centroid prototypes specifically in label-to-image retrieval.
Frozen perceptual geometry enables initial grounding but prevents acquisition of concepts outside its natural clusters without adaptation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Multi-agent systems would need aligned perceptual geometries to reach stable shared lexicons on the same visual inputs.
The framework could be used to test whether fine-tuning the embedding layer allows agents to learn previously unlearnable disjunctive concepts.
Current large vision-language models may inherit similar limits on which novel word meanings they can ground without updating their visual representations.

Load-bearing premise

Frozen DINOv2 embeddings supply a fixed, representative geometry for the tested visual concepts so that observed learning differences reflect properties of that geometry rather than model-specific artifacts.

What would settle it

Re-running the dissociation experiment with a different visual embedding model on the same CIFAR-100 concepts and finding that semantic distance then predicts accuracy better than perceptual distance would falsify the claim that the gradient is governed by perceptual geometry.

Figures

Figures reproduced from arXiv: 2606.22207 by Patricio M. Vera.

**Figure 2.** Figure 2: C1 naming accuracy follows a perceptual-coherence gradient. Native categories are [PITH_FULL_IMAGE:figures/full_fig_p015_2.png] view at source ↗

**Figure 3.** Figure 3: C1 naming accuracy by dissociation quadrant. Pairs are classified by whether their [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: C2 exemplar-over-centroid retrieval gap across concept tiers and candidate-pool con [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: C2 retrieval accuracy across homogeneous candidate-pool constructions. Each pool type [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: PCA projection of the frozen DINOv2-small embedding space for the initial visual cate [PITH_FULL_IMAGE:figures/full_fig_p034_6.png] view at source ↗

**Figure 7.** Figure 7: Single-agent confusion matrices across seed sizes. The mapping saturates after 10 exam [PITH_FULL_IMAGE:figures/full_fig_p034_7.png] view at source ↗

**Figure 8.** Figure 8: Grounding-control confusion matrices for C1. Performance degrades or collapses when [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗

**Figure 9.** Figure 9: Entropy curve from the information-theoretic analysis. The curve provides an additional [PITH_FULL_IMAGE:figures/full_fig_p035_9.png] view at source ↗

**Figure 10.** Figure 10: Alignment gain by label in the passive centroid-alignment experiment. Gains remain [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗

**Figure 11.** Figure 11: PCA visualization of cluster centroids in the regional-divergence experiment. [PITH_FULL_IMAGE:figures/full_fig_p036_11.png] view at source ↗

**Figure 12.** Figure 12: Partial-regression diagnostic for the CIFAR-100 dissociation experiment. After control [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗

read the original abstract

Artificial intelligence systems are commonly evaluated through task performance and behavioral imitation, but such evaluations leave open whether an artificial agent can acquire, stabilize, and use new lexical meanings from grounded experience. This paper introduces Lexical Consensus, an experimental framework for studying grounded word learning over a structured perceptual substrate. Using frozen DINOv2 visual embeddings, Carroll-style nonce words, and interpretable lexical learners plus linear baselines, we test whether agents can acquire artificial labels for visual concepts, generalize them bidirectionally, and stabilize them across controlled settings. The main result is a robust perceptual-coherence gradient: native categories are easiest to learn, coherent overextensions remain learnable, mid-range disjunctive concepts degrade, and far-disjunctive concepts approach chance. A pre-registered CIFAR-100 dissociation experiment confirms that this gradient is governed by perceptual distance rather than semantic relatedness: perceptual distance predicts acquisition accuracy (partial R^2 = 0.245, p < 1e-7), while semantic distance adds no significant explanatory power (partial R^2 = 0.002, p = 0.660). Bidirectional evaluation shows that naming and retrieval are distinct: exemplar-based mechanisms outperform centroid prototypes in label-to-image retrieval, exposing a memory-fidelity dimension separate from naming accuracy. Falsification controls, homogeneous candidate-pool evaluations, and null results on representational restructuring indicate that frozen perceptual geometry both enables lexical grounding and limits what can be acquired without representational adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean experimental demonstration that perceptual distance in frozen DINOv2 space predicts nonce-word acquisition accuracy in agents while semantic distance adds nothing, backed by a pre-registered dissociation.

read the letter

The main takeaway is that this work supplies a controlled way to measure how visual geometry shapes lexical acquisition in artificial agents, and the dissociation experiment supports the claim that perceptual distance matters far more than semantic relatedness.

What is new is the Lexical Consensus framework, the perceptual-coherence gradient across native, overextended, and disjunctive concepts, and the pre-registered CIFAR-100 study that reports partial R-squared values separating the two distance types. The bidirectional naming-versus-retrieval evaluation and the use of Carroll-style nonce words with linear baselines are also fresh moves in this area.

The paper does the empirical part well. The gradient holds up, the perceptual predictor reaches partial R^2 = 0.245 with p < 1e-7 while semantic distance is flat at 0.002 and p = 0.660, and the falsification controls plus homogeneous pool checks add credibility. The distinction between exemplar and prototype mechanisms in retrieval is a useful extra finding.

The soft spot is the heavy dependence on frozen DINOv2 embeddings as the perceptual geometry. If those embeddings already carry correlations with CIFAR-100 super-categories, the dissociation could partly reflect how the two distance matrices were built rather than a pure perceptual-versus-semantic contrast. Stability across label-assignment choices is not obviously demonstrated in the abstract-level description, so that assumption carries more weight than the paper acknowledges.

This is for people working on grounded language learning and vision-language agents. A reader who wants falsifiable tests of how perceptual structure constrains lexical grounding will find usable results here. It deserves a serious referee because the pre-registration and statistical separation give the central claim something concrete to evaluate.

Referee Report

1 major / 2 minor

Summary. The paper introduces the Lexical Consensus framework for studying grounded word learning in artificial agents. Using frozen DINOv2 visual embeddings, nonce words, and lexical learners, it reports a perceptual-coherence gradient (native categories easiest, far-disjunctive near chance) and a pre-registered CIFAR-100 dissociation experiment in which perceptual distance predicts acquisition accuracy (partial R² = 0.245, p < 1e-7) while semantic distance adds none (partial R² = 0.002, p = 0.660). The work concludes that frozen perceptual geometry both enables and limits lexical grounding without representational adaptation, supported by bidirectional naming/retrieval tests and falsification controls.

Significance. If the results hold, the pre-registered dissociation experiment with partial R-squared values and p-values, together with falsification controls and null results on restructuring, supplies direct statistical evidence that perceptual structure governs lexical acquisition independently of semantic relatedness. This strengthens empirical grounding for claims about the enabling and limiting role of frozen perceptual geometries in artificial lexical learning.

major comments (1)

[CIFAR-100 dissociation experiment] The dissociation result and the claim that frozen DINOv2 geometry governs the gradient (abstract and dissociation experiment section) rest on the assumption that DINOv2 distances provide an unconfounded perceptual measure. Because DINOv2 is pre-trained on large-scale image data that may correlate with CIFAR-100 super-categories, the partial-R² contrast could arise from how the two distance matrices are constructed rather than from genuine perceptual vs semantic separation; an explicit orthogonality test or replication with alternative embeddings is required to secure this load-bearing step.

minor comments (2)

[Abstract] The abstract supplies only high-level method descriptions without full protocols, data splits, or error-bar details, which limits immediate reproducibility assessment.
[Methods] Clarify the exact operationalization of perceptual and semantic distance matrices and the regression model specification used for the partial R² calculations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the dissociation experiment. The concern about potential confounds in the DINOv2 distance measure is substantive and directly relevant to the load-bearing claim. We address it point-by-point below and commit to revisions that strengthen the result without altering the core findings.

read point-by-point responses

Referee: [CIFAR-100 dissociation experiment] The dissociation result and the claim that frozen DINOv2 geometry governs the gradient (abstract and dissociation experiment section) rest on the assumption that DINOv2 distances provide an unconfounded perceptual measure. Because DINOv2 is pre-trained on large-scale image data that may correlate with CIFAR-100 super-categories, the partial-R² contrast could arise from how the two distance matrices are constructed rather than from genuine perceptual vs semantic separation; an explicit orthogonality test or replication with alternative embeddings is required to secure this load-bearing step.

Authors: We agree that an explicit check on the independence of the two distance matrices is necessary to rule out construction artifacts. DINOv2 is trained self-supervised on unlabeled images, so its geometry reflects visual feature similarity rather than category labels; the semantic distance matrix is derived separately from WordNet super-category structure. Nevertheless, to directly test whether the partial-R² dissociation could be an artifact of matrix construction, we will add an orthogonality analysis in the revised manuscript: we will report the Pearson correlation between the DINOv2 perceptual distance matrix and the semantic distance matrix across the CIFAR-100 stimuli. A low correlation would confirm that the two predictors are not redundant and that the unique variance captured by perceptual distance is genuine. Should the correlation prove substantial, we will discuss the implication and, if needed, replicate the dissociation using an alternative visual embedding (e.g., a ResNet trained on a disjoint dataset). This addition will be placed in the dissociation experiment section and referenced in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity; central claims are direct empirical measurements from pre-registered tests

full rationale

The paper reports experimental results on lexical acquisition using frozen DINOv2 embeddings, nonce labels, and statistical dissociation (partial R^2 values and p-values) between perceptual and semantic distances on CIFAR-100. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing steps; the perceptual-coherence gradient and bidirectional naming/retrieval findings are measured outcomes rather than quantities defined in terms of themselves or prior author work. The framework is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The reported results rest on pre-existing visual embedding models and standard statistical procedures for regression and significance testing; no new free parameters, axioms beyond ordinary statistical assumptions, or invented entities are introduced in the abstract.

axioms (1)

standard math Standard assumptions for partial R-squared calculations and p-value interpretation in multiple regression hold for the dissociation analysis.
Invoked when reporting partial R^2 = 0.245 for perceptual distance and partial R^2 = 0.002 for semantic distance.

pith-pipeline@v0.9.1-grok · 5788 in / 1540 out tokens · 35809 ms · 2026-06-26T11:37:42.417751+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 4 canonical work pages

[1]

2023 , eprint =

Rethinking the Evaluating Framework for Natural Language Understanding in AI Systems: Language Acquisition as a Core for Future Metrics , author =. 2023 , eprint =

2023
[2]

1871 , publisher =

Through the Looking-Glass, and What Alice Found There , author =. 1871 , publisher =
[3]

2023 , eprint =

Oquab, Maxime and Darcet, Timoth. 2023 , eprint =

2023
[4]

Mind , volume =

Computing Machinery and Intelligence , author =. Mind , volume =. 1950 , doi =

1950
[5]

Artificial Intelligence Review , volume =

Evaluation in Artificial Intelligence: From Task-Oriented to Ability-Oriented Measurement , author =. Artificial Intelligence Review , volume =. 2017 , doi =

2017
[6]

arXiv preprint arXiv:1911.01547 , year =

On the Measure of Intelligence , author =. arXiv preprint arXiv:1911.01547 , year =. 1911.01547 , archivePrefix =

Pith/arXiv arXiv 1911
[7]

Physica D: Nonlinear Phenomena , volume =

The Symbol Grounding Problem , author =. Physica D: Nonlinear Phenomena , volume =. 1990 , doi =

1990
[8]

Bender and Alexander Koller

Bender, Emily M. and Koller, Alexander , booktitle =. Climbing towards. 2020 , publisher =. doi:10.18653/v1/2020.acl-main.463 , url =

work page doi:10.18653/v1/2020.acl-main.463 2020
[9]

Experience Grounds Language

Experience Grounds Language , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =. 2020 , publisher =. doi:10.18653/v1/2020.emnlp-main.703 , url =

work page doi:10.18653/v1/2020.emnlp-main.703 2020
[10]

1969 , publisher =

Convention: A Philosophical Study , author =. 1969 , publisher =

1969
[11]

Artificial Life , volume =

A Self-Organizing Spatial Vocabulary , author =. Artificial Life , volume =. 1995 , doi =

1995
[12]

2015 , publisher =

The Talking Heads Experiment: Origins of Words and Meanings , author =. 2015 , publisher =. doi:10.17169/FUDOCS_document_000000022455 , url =

work page doi:10.17169/fudocs_document_000000022455 2015
[13]

Journal of Statistical Mechanics: Theory and Experiment , volume =

Sharp Transition towards Shared Vocabularies in Multi-Agent Systems , author =. Journal of Statistical Mechanics: Theory and Experiment , volume =. 2006 , doi =

2006
[14]

2016 , eprint =

Multi-Agent Cooperation and the Emergence of (Natural) Language , author =. 2016 , eprint =

2016
[15]

International Conference on Learning Representations , year =

Emergence of Linguistic Communication from Referential Games with Symbolic and Pixel Input , author =. International Conference on Learning Representations , year =. 1804.03984 , archivePrefix =

Pith/arXiv arXiv
[16]

Advances in Neural Information Processing Systems , volume =

Emergence of Language with Multi-Agent Games: Learning to Communicate with Sequences of Symbols , author =. Advances in Neural Information Processing Systems , volume =. 2017 , url =

2017
[17]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

Emergence of Grounded Compositional Language in Multi-Agent Populations , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2018 , doi =

2018
[18]

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages =

Natural Language Does Not Emerge `Naturally' in Multi-Agent Dialog , author =. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages =. 2017 , publisher =. doi:10.18653/v1/D17-1321 , url =

work page doi:10.18653/v1/d17-1321 2017
[19]

Proceedings of the 36th International Conference on Machine Learning , pages =

Similarity of Neural Network Representations Revisited , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , volume =

2019
[20]

arXiv preprint arXiv:1808.10696 , year =

How Agents See Things: On Visual Representations in an Emergent Language Game , author =. arXiv preprint arXiv:1808.10696 , year =. 1808.10696 , archivePrefix =

Pith/arXiv arXiv
[21]

Advances in Neural Information Processing Systems , volume =

Matching Networks for One Shot Learning , author =. Advances in Neural Information Processing Systems , volume =. 2016 , url =

2016
[22]

Advances in Neural Information Processing Systems , volume =

Prototypical Networks for Few-shot Learning , author =. Advances in Neural Information Processing Systems , volume =. 2017 , url =

2017
[23]

Proceedings of the 38th International Conference on Machine Learning , pages =

Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =

2021
[24]

2009 , institution=

Learning multiple layers of features from tiny images , author=. 2009 , institution=

2009

[1] [1]

2023 , eprint =

Rethinking the Evaluating Framework for Natural Language Understanding in AI Systems: Language Acquisition as a Core for Future Metrics , author =. 2023 , eprint =

2023

[2] [2]

1871 , publisher =

Through the Looking-Glass, and What Alice Found There , author =. 1871 , publisher =

[3] [3]

2023 , eprint =

Oquab, Maxime and Darcet, Timoth. 2023 , eprint =

2023

[4] [4]

Mind , volume =

Computing Machinery and Intelligence , author =. Mind , volume =. 1950 , doi =

1950

[5] [5]

Artificial Intelligence Review , volume =

Evaluation in Artificial Intelligence: From Task-Oriented to Ability-Oriented Measurement , author =. Artificial Intelligence Review , volume =. 2017 , doi =

2017

[6] [6]

arXiv preprint arXiv:1911.01547 , year =

On the Measure of Intelligence , author =. arXiv preprint arXiv:1911.01547 , year =. 1911.01547 , archivePrefix =

Pith/arXiv arXiv 1911

[7] [7]

Physica D: Nonlinear Phenomena , volume =

The Symbol Grounding Problem , author =. Physica D: Nonlinear Phenomena , volume =. 1990 , doi =

1990

[8] [8]

Bender and Alexander Koller

Bender, Emily M. and Koller, Alexander , booktitle =. Climbing towards. 2020 , publisher =. doi:10.18653/v1/2020.acl-main.463 , url =

work page doi:10.18653/v1/2020.acl-main.463 2020

[9] [9]

Experience Grounds Language

Experience Grounds Language , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages =. 2020 , publisher =. doi:10.18653/v1/2020.emnlp-main.703 , url =

work page doi:10.18653/v1/2020.emnlp-main.703 2020

[10] [10]

1969 , publisher =

Convention: A Philosophical Study , author =. 1969 , publisher =

1969

[11] [11]

Artificial Life , volume =

A Self-Organizing Spatial Vocabulary , author =. Artificial Life , volume =. 1995 , doi =

1995

[12] [12]

2015 , publisher =

The Talking Heads Experiment: Origins of Words and Meanings , author =. 2015 , publisher =. doi:10.17169/FUDOCS_document_000000022455 , url =

work page doi:10.17169/fudocs_document_000000022455 2015

[13] [13]

Journal of Statistical Mechanics: Theory and Experiment , volume =

Sharp Transition towards Shared Vocabularies in Multi-Agent Systems , author =. Journal of Statistical Mechanics: Theory and Experiment , volume =. 2006 , doi =

2006

[14] [14]

2016 , eprint =

Multi-Agent Cooperation and the Emergence of (Natural) Language , author =. 2016 , eprint =

2016

[15] [15]

International Conference on Learning Representations , year =

Emergence of Linguistic Communication from Referential Games with Symbolic and Pixel Input , author =. International Conference on Learning Representations , year =. 1804.03984 , archivePrefix =

Pith/arXiv arXiv

[16] [16]

Advances in Neural Information Processing Systems , volume =

Emergence of Language with Multi-Agent Games: Learning to Communicate with Sequences of Symbols , author =. Advances in Neural Information Processing Systems , volume =. 2017 , url =

2017

[17] [17]

Proceedings of the AAAI Conference on Artificial Intelligence , volume =

Emergence of Grounded Compositional Language in Multi-Agent Populations , author =. Proceedings of the AAAI Conference on Artificial Intelligence , volume =. 2018 , doi =

2018

[18] [18]

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages =

Natural Language Does Not Emerge `Naturally' in Multi-Agent Dialog , author =. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages =. 2017 , publisher =. doi:10.18653/v1/D17-1321 , url =

work page doi:10.18653/v1/d17-1321 2017

[19] [19]

Proceedings of the 36th International Conference on Machine Learning , pages =

Similarity of Neural Network Representations Revisited , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , volume =

2019

[20] [20]

arXiv preprint arXiv:1808.10696 , year =

How Agents See Things: On Visual Representations in an Emergent Language Game , author =. arXiv preprint arXiv:1808.10696 , year =. 1808.10696 , archivePrefix =

Pith/arXiv arXiv

[21] [21]

Advances in Neural Information Processing Systems , volume =

Matching Networks for One Shot Learning , author =. Advances in Neural Information Processing Systems , volume =. 2016 , url =

2016

[22] [22]

Advances in Neural Information Processing Systems , volume =

Prototypical Networks for Few-shot Learning , author =. Advances in Neural Information Processing Systems , volume =. 2017 , url =

2017

[23] [23]

Proceedings of the 38th International Conference on Machine Learning , pages =

Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , volume =

2021

[24] [24]

2009 , institution=

Learning multiple layers of features from tiny images , author=. 2009 , institution=

2009