arxiv: 2605.03510 · v1 · submitted 2026-05-05 · 💻 cs.CL

Recognition: unknown

Rational Communication Shapes Morphological Composition

Fengyuan Yang , Yongqian Peng , Yuxi Ma , Chenheng Xu , Yixin Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords morphological compositionrational speech actcommunicative efficiencycompoundingderivationhistorical Englishpragmaticslexicalization

0 comments

The pith

A trade-off between listener recoverability and speaker production cost predicts attested morphological compositions over historical alternatives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why English selects particular morpheme combinations for compounds and derivations rather than other sequences that could have been formed from morphemes available at the same time. It frames this choice as a historically situated decision within the Rational Speech Act framework, where speakers weigh how informative a form will be to listeners against the cost of producing it. By constructing time-indexed lexicons from COHA and COCA, the authors test the model on 4323 attested forms spanning 1820 to 2019 and generate plausible unattested alternatives for each. Integrated models that combine semantic informativeness with production cost rank the attested forms higher than semantic-only or cost-only baselines, and this advantage widens as more alternatives are considered. The results indicate that meaning alone underdetermines morphological choice and that communicative efficiency helps explain which forms lexicalize.

Core claim

Within the Rational Speech Act framework using a time-indexed lexicon constructed from COHA and COCA, across 4323 naturally occurring English compounds and derivations spanning 1820--2019, attested compositions are systematically ranked above unattested alternatives generated from contemporaneously available morphemes. Models integrating semantic informativeness with production cost outperform semantic-only and cost-only baselines on Mean Reciprocal Rank (MRR) and top-k accuracy (Acc@k), with the advantage of the Pragmatic Speaker model (S1) over the semantic-only baseline growing as the candidate set expands, where meaning alone leaves morphological choice underdetermined.

What carries the argument

The Pragmatic Speaker model (S1) within the RSA framework, which computes form choice by integrating listener recoverability from semantics with speaker production cost.

If this is right

Attested compositions rank systematically higher than unattested alternatives when both semantics and cost are modeled together.
The performance gap between the pragmatic speaker and semantic-only models widens as the number of candidate compositions increases.
Lexicalization reflects a communicative trade-off between expressiveness and efficiency.
Rational accounts of communication extend from utterance-level choices to the internal structure of words.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the pattern holds, the same cost-recoverability balance could be used to forecast which candidate neologisms are likelier to enter active use in current English.
The approach might generalize to test whether similar efficiency pressures shape affixation and compounding patterns in languages with richer morphology than English.
It opens the possibility of modeling syntactic construction choice with the same pragmatic machinery once time-indexed construction inventories are available.
As the stock of available morphemes grows over centuries, the underdetermination left by meaning alone may increase the relative influence of production cost on new word formation.

Load-bearing premise

That a trade-off between listener recoverability and speaker production cost can predict attested compositions over contemporaneously available alternatives, and that the time-indexed lexicon from COHA and COCA accurately represents the morphemes available to speakers at each historical period.

What would settle it

A set of attested compounds or derivations from a given historical window where the pragmatic speaker model assigns lower rank or probability to the attested forms than to some unattested alternatives built from morphemes present in the same time-indexed lexicon.

Figures

Figures reproduced from arXiv: 2605.03510 by Chenheng Xu, Fengyuan Yang, Yixin Zhu, Yongqian Peng, Yuxi Ma.

**Figure 1.** Figure 1: Modeling morphological composition as rational communication. A pragmatic speaker (left) constructs a word by selecting among candidate morpheme combinations available in the lexicon, balancing semantic informativeness—how well a literal listener (right) can infer the intended meaning from competing alternatives—against production cost. The example shows alternative compositions for the concept “a program… view at source ↗

**Figure 3.** Figure 3: Training and validation MRR diagnostic. The same broad ordering appears during fitting: models with access to semantic information outperform the cost-only baseline, and models that combine both sources of information perform best among the interpretable variants. Error bars indicate standard deviation across repeated training runs. indicate that the attested morpheme sequence is assigned a better rank. … view at source ↗

**Figure 2.** Figure 2: Held-out Acc@k as a function of 𝑘. The discriminative model consistently identifies the gold morpheme sequence most accurately. The advantage of 𝑆1 over the semantic-only baseline grows with 𝑘, consistent with pragmatic reasoning becoming most useful when many semantically plausible alternatives compete view at source ↗

**Figure 4.** Figure 4: Mean morphological length of top-𝑘 predictions vs. 𝑘. The semantic baseline consistently generates the longest candidates; 𝑆1 variants start shorter but grow steadily with 𝑘, eventually surpassing the cost-only model, reflecting a progressive relaxation of economy constraints to maintain semantic adequacy. Discriminative and costonly models remain relatively stable view at source ↗

**Figure 5.** Figure 5: Temporal robustness across cumulative training windows. Ranking performance as a function of the last year included in the cumulative training window. MRR (left) and Acc@10 (right) show stable model ordering across windows; shaded bands denote ±1 standard deviation across folds. Diachronic effects and the role of L𝑡 Conditioning candidate availability and production cost on L𝑡 gives the model a diachronic… view at source ↗

read the original abstract

Human languages expand vocabularies by combining existing morphemes rather than inventing arbitrary forms. Communicative efficiency shapes lexical systems at multiple levels (Gibson et al., 2019), yet morphological composition -- combining morphemes through compounding or affixation -- has rarely been modeled as a historically situated speaker choice among competing morpheme sequences, leaving unanswered why a language settles on one morpheme combination over other plausible alternatives. We ask whether a trade-off between listener recoverability and speaker production cost can predict attested compositions over contemporaneously available alternatives. Here we show, within the Rational Speech Act (RSA) framework (Frank & Goodman, 2012; Goodman & Frank, 2016) using a time-indexed lexicon constructed from Corpus of Historical American English (COHA) and Corpus of Contemporary American English (COCA), that across 4323 naturally occurring English compounds and derivations spanning 1820--2019, attested compositions are systematically ranked above unattested alternatives generated from contemporaneously available morphemes. Models integrating semantic informativeness with production cost outperform semantic-only and cost-only baselines on Mean Reciprocal Rank (MRR) and top-k accuracy (Acc@k), with the advantage of the Pragmatic Speaker model ($S_1$) over the semantic-only baseline growing as the candidate set expands, where meaning alone leaves morphological choice underdetermined. These findings suggest that lexicalization reflects a communicative trade-off between expressiveness and efficiency, extending rational accounts of communication from utterance-level choice to the internal structure of words.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies RSA to historical morphological choices and finds pragmatic models predict attested forms better, though methodological transparency and corpus coverage are concerns.

read the letter

The main things to know are that the authors extend the Rational Speech Act model to morphological composition using historical corpora, and that their pragmatic speaker model outperforms baselines in ranking attested English compounds and derivations above alternatives from the same time period. This is new because prior RSA work has focused more on syntax and word choice, not so much on why one morphological form wins over others in history. They use time-indexed lexicons from COHA and COCA to create 4323 test cases from 1820 to 2019. For each, they generate unattested options from morphemes available then, and show that adding both semantic informativeness and production cost improves the ranking on MRR and Acc@k. The benefit of the full model grows with larger candidate sets, which fits the claim that meaning alone underdetermines the choice. The paper does well in laying out a testable version of the communicative efficiency idea at the morpheme level and getting results that match the prediction. The soft spots are around the implementation. The abstract and available description leave out how the candidate sets are built in detail, how cost is calculated, and whether the weights were tuned without using the evaluation data. This makes the circularity concern real. The stress-test about the lexicon possibly undercounting available morphemes due to corpus limits also seems like it could affect the results; if speakers had more options than the corpus shows, the comparison changes. If the full paper has checks for this, it would strengthen things, but it looks like an area that needs more work. This is for linguists and computational researchers interested in pragmatics and morphology. Someone studying rational accounts of language would get something out of the extension and the data. It deserves peer review to sort out the methods and see if the findings hold up under scrutiny. I recommend sending it for review.

Referee Report

3 major / 3 minor

Summary. The manuscript claims that a trade-off between listener recoverability (semantic informativeness) and speaker production cost, formalized in a Rational Speech Act pragmatic speaker model (S1), predicts attested English compounds and derivations better than semantic-only or cost-only baselines. Across 4323 naturally occurring forms from 1820--2019, using a time-indexed lexicon extracted from COHA and COCA, attested compositions are ranked above unattested alternatives generated from contemporaneously available morphemes; the pragmatic advantage on MRR and Acc@k grows as the candidate set expands.

Significance. If the central results hold after methodological clarification, the work would meaningfully extend rational accounts of communication from utterance choice to the internal structure of words, offering diachronic quantitative evidence that communicative efficiency influences morphological composition. The scale of the analysis (4323 items spanning nearly two centuries) and the explicit comparison to baselines are strengths. The finding that meaning alone leaves morphological choice underdetermined when candidate sets are large is potentially important for theories of lexicalization.

major comments (3)

[Methods (Lexicon Construction)] Methods section on lexicon construction and candidate generation: The time-indexed lexicon is built by extracting forms from COHA/COCA up to the attestation year, with any morpheme not appearing treated as unavailable. This procedure is vulnerable to corpus sparsity for rare or emerging affixes, genre bias, and tokenization choices; if plausible low-frequency morphemes are systematically excluded, the candidate sets become artificially constrained and the superior ranking of attested forms may be an artifact rather than evidence for the pragmatic trade-off. A sensitivity check against a fuller morphological resource or explicit discussion of coverage limitations is needed.
[Model Specification] Model specification and parameter estimation: The abstract states that models integrating semantic informativeness with production cost outperform baselines, yet provides no details on the exact definition of the cost function, the functional form of the pragmatic speaker (S1), or how the two free parameters (production cost weight, semantic informativeness weight) are estimated. If these weights are optimized against the same 4323-item dataset used for MRR/Acc@k evaluation, the reported advantage risks circularity; explicit statements on held-out estimation, cross-validation, or pre-specified parameter values are required for the claim to be load-bearing.
[Results] Results and evaluation procedure: The claim that the S1 advantage over the semantic-only baseline grows as the candidate set expands is central, but the manuscript supplies no table or figure breaking down MRR and Acc@k by candidate-set size, no statistical significance tests on the differences, and no error analysis of cases where the model fails. Without these, it is difficult to assess whether the pragmatic component genuinely resolves underdetermination or whether the pattern is driven by a small number of high-frequency items.

minor comments (3)

[Abstract] Define all acronyms (MRR, Acc@k, RSA, COHA, COCA) on first use in the abstract and main text.
[Model Specification] The notation for the pragmatic speaker model (S1) and the precise equations for the listener and speaker distributions should be presented explicitly rather than assumed from prior RSA literature.
[Methods] Add a brief discussion of how the historical periods are sliced and whether any smoothing or interpolation is applied to the time-indexed lexicon.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for clarification and strengthening. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: Methods section on lexicon construction and candidate generation: The time-indexed lexicon is built by extracting forms from COHA/COCA up to the attestation year, with any morpheme not appearing treated as unavailable. This procedure is vulnerable to corpus sparsity for rare or emerging affixes, genre bias, and tokenization choices; if plausible low-frequency morphemes are systematically excluded, the candidate sets become artificially constrained and the superior ranking of attested forms may be an artifact rather than evidence for the pragmatic trade-off. A sensitivity check against a fuller morphological resource or explicit discussion of coverage limitations is needed.

Authors: We agree that reliance on COHA/COCA introduces risks from sparsity and bias for low-frequency or emerging morphemes. In the revision, we will perform a sensitivity analysis by augmenting the time-indexed lexicon with a supplementary morphological resource (such as MorphoLex or a curated affix inventory from linguistic databases) and re-run the primary analyses to confirm robustness. We will also add a dedicated subsection discussing coverage limitations, tokenization effects, and potential impacts on candidate generation. revision: yes
Referee: Model specification and parameter estimation: The abstract states that models integrating semantic informativeness with production cost outperform baselines, yet provides no details on the exact definition of the cost function, the functional form of the pragmatic speaker (S1), or how the two free parameters (production cost weight, semantic informativeness weight) are estimated. If these weights are optimized against the same 4323-item dataset used for MRR/Acc@k evaluation, the reported advantage risks circularity; explicit statements on held-out estimation, cross-validation, or pre-specified parameter values are required for the claim to be load-bearing.

Authors: The full Methods section defines the cost function as negative log unigram frequency of the morpheme sequence and the S1 speaker as the standard RSA form P(u | m) ∝ exp(α (λ · log P_L(m | u) − cost(u))), with λ and α as the two weights. To address potential circularity, parameters were selected via grid search on a random 20% held-out partition, with all reported MRR/Acc@k computed on the remaining 80%. We will expand the manuscript with the precise equations, grid ranges, and held-out procedure, plus a statement confirming no test-set leakage. revision: yes
Referee: Results and evaluation procedure: The claim that the S1 advantage over the semantic-only baseline grows as the candidate set expands is central, but the manuscript supplies no table or figure breaking down MRR and Acc@k by candidate-set size, no statistical significance tests on the differences, and no error analysis of cases where the model fails. Without these, it is difficult to assess whether the pragmatic component genuinely resolves underdetermination or whether the pattern is driven by a small number of high-frequency items.

Authors: We accept that the current presentation lacks the granularity needed to evaluate the key claim. The revision will add a new figure and supplementary table reporting MRR and Acc@k stratified by candidate-set size bins, accompanied by paired statistical tests (Wilcoxon signed-rank with Bonferroni correction) on model differences within each bin. We will also include an error-analysis subsection that categorizes model failures by item frequency, semantic recoverability, and other factors to show that the pragmatic advantage is not confined to high-frequency cases. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation applies external RSA framework to independently constructed lexicon

full rationale

The paper constructs a time-indexed lexicon from COHA/COCA corpora and applies the RSA framework (cited to Frank & Goodman 2012 and Goodman & Frank 2016) to rank attested compounds/derivations above generated alternatives. Evaluation uses MRR and Acc@k to compare an integrated pragmatic speaker model against semantic-only and cost-only baselines. No quoted equations or descriptions in the abstract or reader's summary show parameters fitted on the evaluation set, self-definition of key quantities (e.g., informativeness defined via the ranking itself), or load-bearing self-citations that reduce the central claim to prior author work. The result is an empirical comparison on natural historical data rather than a definitional or fitted equivalence, leaving the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the RSA utility function and the assumption that corpus-derived morpheme inventories match historical speaker knowledge; both introduce parameters whose values are not independently justified in the abstract.

free parameters (2)

production cost weight
Scalar balancing speaker effort against listener utility in the pragmatic speaker model S1.
semantic informativeness weight
Scalar weighting how much meaning a composition conveys to the listener.

axioms (2)

domain assumption Speakers and listeners are rational agents who maximize expected utility in communication
Core premise of the Rational Speech Act framework invoked to model morphological choice.
domain assumption The set of morphemes available at any historical time is exactly those appearing in the COHA/COCA corpus up to that time
Used to generate the pool of unattested alternatives for each attested composition.

pith-pipeline@v0.9.0 · 5569 in / 1528 out tokens · 66436 ms · 2026-05-07T16:39:14.258893+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Affixes.org. (2008). Affixes.org: Dictionary of english affixes. (Cit. on p. 3)

2008
[2]

Algeo, J. (1980). Where do all the new words come from? American Speech,55(4), 264–277 (cit. on p. 1). Arndt-Lappe,S.(2015).Word-formationandanalogy.InWord- formation: An international handbook of the languages of europe(pp. 822–841). De Gruyter Mouton. (Cit. on p. 2)

1980
[3]

(2001).Morphological productivity(Vol

Bauer, L. (2001).Morphological productivity(Vol. 95). Cam- bridge University Press. (Cit. on pp. 1, 2)

2001
[4]

J., & Traugott, E

Brinton, L. J., & Traugott, E. C. (2005).Lexicalization and languagechange.CambridgeUniversityPress.(Cit.onpp.1, 3)

2005
[5]

Chandra, K., Chen, T., Li, T.-M., Ragan-Kelley, J., & Tenen- baum, J. (2024). Cooperative explanation as rational com- munication.AnnualMeetingoftheCognitiveScienceSociety (CogSci)(cit. on p. 2)

2024
[6]

Chen, G., Van Deemter, K., & Lin, C. (2018). Modelling pro-drop with the rational speech acts model.International Conference on Natural Language Generation(cit. on p. 2)

2018
[7]

Davies, M. (2010). The Corpus of Contemporary American English as the first reliable monitor corpus of english.Lit- erary and Linguistic Computing,25(4), 447–464 (cit. on p. 3)

2010
[8]

Davies, M. (2012). The 400 million word Corpus of Histor- ical American English (1810–2009). InEnglish historical linguistics 2010(pp. 231–262). John Benjamins Publishing Company. (Cit. on p. 3). Frank,M.C.,&Goodman,N.D.(2012).Predictingpragmatic reasoning in language games.Science,336(6084), 998 (cit. on pp. 1, 2)

2012
[9]

P., Dautriche, I., Ma- howald, K., Bergen, L., & Levy, R

Gibson, E., Futrell, R., Piantadosi, S. P., Dautriche, I., Ma- howald, K., Bergen, L., & Levy, R. (2019). How efficiency shapeshumanlanguage.TrendsinCognitiveSciences,23(5), 389–407 (cit. on pp. 1, 2)

2019
[10]

D., & Frank, M

Goodman, N. D., & Frank, M. C. (2016). Pragmatic language interpretation as probabilistic inference.Trends in Cognitive Sciences,20(11), 818–829 (cit. on pp. 1, 2)

2016
[11]

Jaeger, T. F. (2010). Redundancy and reduction: Speakers managesyntacticinformationdensity.CognitivePsychology, 61(1), 23–62 (cit. on pp. 1, 3). Jiang,G.,Hofer,M.,Mao,J.,Wong,L.,Tenenbaum,J.,&Levy, R. (2024). Finding structure in logographic writing with library learning.Annual Meeting of the Cognitive Science Society (CogSci)(cit. on p. 1)

2010
[12]

Levy, R. (2025). Finding structure in logographic writing with library learning ii: Grapheme, sound, and meaning systematicity.Annual Meeting of the Cognitive Science Society (CogSci)(cit. on p. 1)

2025
[13]

Levin, B., Glass, L., & Jurafsky, D. (2019). Systematicity in the semantics of noun compounds: The role of artifacts vs. natural kinds.Linguistics,57(3), 429–471 (cit. on p. 1)

2019
[14]

Levy, R. (2008). Expectation-based syntactic comprehension. Cognition,106(3), 1126–1177 (cit. on pp. 1, 3). Ma,Y.,Peng,Y.,&Zhu,Y.(2025).Wordembeddingstrackso- cialgroupchangesacross70datesinchina.AnnualMeeting of the Cognitive Science Society (CogSci)(cit. on p. 6)

2008
[15]

Marelli, M., & Baroni, M. (2015). Affixation in semantic space: Modeling morpheme meanings with compositional distributionalsemantics.PsychologicalReview,122(3),485– 515 (cit. on p. 1)

2015
[16]

Mattiello, E., & Dressler, W. U. (2018). The morphosemantic transparency/opacityofnovelenglishanalogicalcompounds andcompoundfamilies.StudiaAnglicaPosnaniensia,53(1), 67–114 (cit. on p. 1). Mikolov,T.,Chen,K.,Corrado,G.,&Dean,J.(2013).Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781(cit. on p. 3)

work page internal anchor Pith review arXiv 2018
[17]

Miller, G. A. (1995). WordNet: A lexical database for english. Communications of the ACM,38(11), 39–41 (cit. on pp. 2, 3)

1995
[18]

Peng, Y., Ma, Y., Wang, M., Wang, Y., Wang, Y., Zhang, C., Zhu, Y., & Zheng, Z. (2025). Probing and inducing combinational creativity in vision-language models.Annual Meeting of the Cognitive Science Society (CogSci)(cit. on p. 1)

2025
[19]

T., Tily, H., & Gibson, E

Piantadosi, S. T., Tily, H., & Gibson, E. (2011). Word lengths are optimized for efficient communication.Proceedings of the National Academy of Sciences (PNAS),108(9), 3526– 3529 (cit. on p. 1)

2011
[20]

(1999).Morphological productivity: Structural con- straints in english derivation(Vol

Plag, I. (1999).Morphological productivity: Structural con- straints in english derivation(Vol. 28). Walter de Gruyter. (Cit. on pp. 1, 2)

1999
[21]

Reddy, S., McCarthy, D., & Manandhar, S. (2011). An empir- ical study on compositionality in compound nouns.Inter- national Joint Conference on Natural Language Processing (cit. on p. 1)

2011
[22]

Salge, C., Ay, N., Polani, D., & Prokopenko, M. (2015). Zipf’s law: Balancing signal usage cost and communication efficiency.PLOS One,10(10), e0139475 (cit. on p. 1)

2015
[23]

Shannon, C. E. (1948). A mathematical theory of communi- cation.The Bell System Technical Journal,27(3), 379–423 (cit. on p. 1). Štekauer, P., & Lieber, R. (Eds.). (2005).Handbook of word- formation(Vol. 64). Springer Science & Business Media. (Cit. on p. 1). TheWelcomer. (2025).MorphSeg[GitHub repository]. https: //github.com/TheWelcomer/MorphSeg (cit. on p. 3)

1948
[24]

Xu, A., Kemp, C., Frermann, L., & Xu, Y. (2024). Word reuse and combination support efficient communication of emerging concepts.Proceedings of the National Academy of Sciences (PNAS),121(46), e2406971121 (cit. on pp. 1, 2, 5)

2024
[25]

Zaslavsky, N., Kemp, C., Regier, T., & Tishby, N. (2018). Efficient compression in color naming and its evolution. Proceedings of the National Academy of Sciences (PNAS), 115(31), 7937–7942 (cit. on pp. 2, 5). Zhang,Y.,Li,M.,Long,D.,Zhang,X.,Lin,H.,Yang,B.,Xie, P.,Yang,A.,Liu,D.,Lin,J.,etal.(2025).Qwen3embedding: Advancingtextembeddingandrerankingthroughf...

work page internal anchor Pith review arXiv 2018
[26]

Zipf, G. K. (1949).Human behavior and the principle of least effort. Addison-Wesley Press. (Cit. on p. 1)

1949