Modeling Human-Like Color Naming Behavior in Context
Pith reviewed 2026-05-12 00:55 UTC · model grok-4.3
The pith
Combining moderate upsampling of rare color terms with multiple listeners in neural agent training produces color naming lexicons most similar to those of humans.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By adding upsampling of rare color terms during supervised learning from human data and multi-listener reinforcement learning in referential games inside the NeLLCom-Lex framework, the authors find that moderate upsampling boosts lexical diversity and system-level informativeness while multiple listeners promote more convex color categories, and that the combination of these two factors produces lexicons most similar to human systems.
What carries the argument
Upsampling of rare color terms during supervised learning combined with multi-listener reinforcement learning in referential games, evaluated using a convexity metric on regions in color space.
If this is right
- Upsampling rare terms during supervised learning increases lexical diversity and informativeness of the resulting color lexicon.
- Multi-listener reinforcement learning interactions promote more convex color categories than single-listener baselines.
- The combination of moderate upsampling and multiple listeners minimizes systematic divergence from human color naming systems.
- These adjustments address the non-convex regions that appeared in earlier neural agent color lexicons.
Where Pith is reading between the lines
- Human color categories may arise in part from repeated exposure to uncommon shades and from communication with varied interlocutors.
- The convexity metric could be tested on category formation in other perceptual domains such as shapes or sounds.
- Varying the number of listeners in agent models might simulate effects of different social group sizes on language structure.
Load-bearing premise
That referential games and the convexity metric capture the real-world pressures shaping human color naming behavior.
What would settle it
Human color naming data collected under controlled conditions that shows category convexity levels diverging from those produced by the moderate-upsampling multi-listener model, or that removing upsampling fails to reduce similarity to human lexicons.
Figures
read the original abstract
Modeling the emergence of human-like lexicons in computational systems has advanced through the use of interacting neural agents, which simulate both learning and communicative pressures. The NeLLCom-Lex framework (Zhang et al., 2025) allows neural agents to develop pragmatic color naming behavior and human-like lexicons through supervised learning (SL) from human data and reinforcement learning (RL) in referential games. Despite these successes, the lexicons that emerge diverge systematically from human color categories, producing highly non-convex regions in color space, which contrast with the convexity typical of human categories. To address this, we introduce two factors, upsampling rare color terms during SL and multi-listener RL interactions, and adopt a convexity measure to quantify geometric coherence. We find that upsampling improves lexical diversity and system-level informativeness of the color lexicon, while many-listener setups promote more convex color categories. The combination of moderate upsampling and multiple listeners produces lexicons most similar to human systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends the NeLLCom-Lex framework by adding upsampling of rare color terms during supervised learning and multi-listener reinforcement learning in referential games. It claims these factors increase lexical diversity and system-level informativeness (from upsampling) and promote convex color categories (from multiple listeners), with their moderate combination producing lexicons most similar to human color naming systems as quantified by a convexity measure.
Significance. If the empirical results hold under rigorous validation, the work would identify concrete training modifications that better align simulated lexicons with human category geometry, advancing computational models of language emergence and offering testable hypotheses about the roles of data balancing and multi-agent interaction in shaping lexical structure.
major comments (2)
- [Abstract] Abstract: reports directional improvements from the two factors but provides no quantitative results, error bars, statistical tests, or details on how convexity was computed, limiting assessment of whether the data fully support the central claim.
- [Results] The central claim that moderate upsampling plus multiple listeners yields lexicons 'most similar to human systems' relies on convexity plus informativeness as proxies for human-likeness, yet no out-of-sample human validation set or composite similarity score weighting multiple independent human properties (e.g., focal colors) is described.
minor comments (1)
- [Abstract] The abstract could more precisely state the magnitude of gains and the exact definition of the convexity metric used for comparison.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and indicate the revisions made to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: reports directional improvements from the two factors but provides no quantitative results, error bars, statistical tests, or details on how convexity was computed, limiting assessment of whether the data fully support the central claim.
Authors: We agree that the original abstract was too high-level. The revised abstract now reports the key quantitative improvements (e.g., +X% convexity and +Y% informativeness under the moderate-upsampling + multi-listener condition), includes standard errors across runs, and briefly notes the convexity metric (fraction of convex hulls in CIELAB space). Full statistical tests and the precise convexity formula remain in the Methods and Results sections, but the abstract now supplies enough numbers for initial evaluation. revision: yes
-
Referee: [Results] The central claim that moderate upsampling plus multiple listeners yields lexicons 'most similar to human systems' relies on convexity plus informativeness as proxies for human-likeness, yet no out-of-sample human validation set or composite similarity score weighting multiple independent human properties (e.g., focal colors) is described.
Authors: Convexity and informativeness are established proxies in the color-naming literature precisely because they capture geometric coherence and communicative utility, both of which are hallmark properties of human systems (e.g., Regier et al.). Our convexity scores are computed directly against the World Color Survey data, providing an in-sample comparison to human naming. We acknowledge that an out-of-sample held-out human set or a multi-property composite index (including focal colors) is not presented; such extensions would require additional data collection and are noted as future work in the revised Discussion. The current metrics nonetheless allow a principled, literature-grounded comparison, and we have added explicit justification for their use. revision: partial
Circularity Check
No significant circularity; results are empirical simulation outcomes compared to external human data
full rationale
The paper extends the prior NeLLCom-Lex framework by adding upsampling during supervised learning and multi-listener reinforcement learning, then measures effects on lexical diversity, informativeness, and a convexity metric before comparing the resulting lexicons to human color naming data. No derivation, equation, or fitted parameter reduces the reported similarity scores to a quantity defined by the same data or by a self-citation chain. The central findings are simulation results evaluated against independent external benchmarks, satisfying the criteria for non-circularity. Self-citation to the 2025 framework exists but is not load-bearing for the new claims, as the experiments and metrics are independently executed and falsifiable against human data.
Axiom & Free-Parameter Ledger
free parameters (2)
- upsampling rate for rare terms
- number of listeners in RL
axioms (2)
- domain assumption Neural agents can simulate both learning from human data and communicative pressures in referential games
- domain assumption Convexity in color space is a defining geometric property of human color categories
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
color categories are typically compact and contiguous in perceptual color space... convexity universal... Convexity(L) = average |c_i| / |ConvexHull(c_i)|
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The combination of moderate upsampling and multiple listeners produces lexicons most similar to human systems.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Baronchelli, A., Gong, T., Puglisi, A., & Loreto, V. (2010). Modeling the emergence of universality in color naming patterns.Proceedings of the National Academy of Sci- ences,107(6), 2403–2407. https://doi.org/10.1073/pnas. 0908533107 Beckner, C., Blythe, R., Bybee, J., Christiansen, M. H., Croft, W., Ellis, N. C., Holland, J., Ke, J., Larsen- Freeman, D....
-
[2]
Brochhagen, T., & Boleda, G. (2022). When do languages use the same word for different meanings? the goldilocks principle in colexification.Cognition,226, 105179. Campbell, L. (2013).Historical linguistics. Edinburgh Uni- versity Press. Cangelosi, A., & Parisi, D. (2002). Computer simulation: A new scientific approach to the study of language evolu- tion....
-
[3]
Zhang, Y., Ürker, E., Verhoef, T., Boleda, G., & Bisazza, A
Zaslavsky,N.,Kemp,C.,Regier,T.,&Tishby,N.(2018).Effi- cient human-like semantic representations via the informa- tionbottleneckprinciple.arXivpreprintarXiv:1808.03353. Zhang, Y., Ürker, E., Verhoef, T., Boleda, G., & Bisazza, A. (2025, November). NeLLCom-lex: A neural-agent frame- work to study the interplay between lexical systems and language use. In C....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.