The Frequency Confound in Language-Model Surprisal and Metaphor Novelty
Pith reviewed 2026-05-20 22:55 UTC · model grok-4.3
The pith
Word frequency predicts metaphor novelty judgments more strongly than language model surprisal.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across eight Pythia model sizes and 154 training checkpoints, word frequency measures prove stronger predictors of metaphor novelty ratings than surprisal estimates. The surprisal-novelty association reaches its peak at an early training stage before declining, which coincides with a corresponding strengthening of the surprisal-frequency association at the same stage.
What carries the argument
Correlation comparison of surprisal versus two lexical frequency measures as predictors of metaphor novelty ratings, tracked across model scales and training checkpoints.
If this is right
- Reported optimal surprisal settings for modeling metaphor novelty may reflect frequency confounds rather than contextual predictability.
- Lexical frequency may serve as the primary underlying factor in associations between surprisal and processing difficulty.
- Surprisal from later training stages adds little predictive value for novelty once frequency is accounted for.
- Studies relying on surprisal for cognitive modeling of metaphors should include frequency controls to isolate true contextual effects.
Where Pith is reading between the lines
- Future work could test whether simpler frequency-only models match or exceed surprisal performance on novelty prediction tasks.
- Similar frequency confounds may affect surprisal correlations in other domains such as sentence acceptability or reading time studies.
- Reanalyzing prior surprisal papers on linguistic novelty with frequency controls could clarify which effects are genuinely contextual.
Load-bearing premise
The metaphor novelty ratings dataset reflects human judgments without residual influence from lexical frequency, and the two frequency measures capture the full confound.
What would settle it
Statistically partialling frequency out of surprisal and finding that the remaining surprisal-novelty correlation stays significant would falsify the central claim.
Figures
read the original abstract
Language-model (LM) surprisal is widely used as a proxy for contextual predictability and has been reported to correlate with metaphor novelty judgments. However, surprisal is tightly intertwined with lexical frequency. We explore this interaction on metaphor novelty ratings using two different word frequency measures. We analyse surprisal estimates from eight Pythia model sizes and 154 training checkpoints. Across settings, word frequency is a stronger predictor of metaphor novelty than surprisal. Across training stages, the surprisal--novelty association peaks at an early stage and then falls again, mirroring a similarly timed increase in the surprisal--frequency association. These results suggest that the often-reported optimal LM surprisal settings may incorrectly associate contextual predictability with metaphor novelty and processing difficulty, whereas lexical frequency may be the major underlying factor.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the potential frequency confound in using language-model surprisal as a predictor of metaphor novelty ratings. It analyzes surprisal estimates from eight Pythia model sizes across 154 training checkpoints, paired with two distinct word frequency measures, and reports that frequency is a stronger predictor of novelty judgments than surprisal across settings. It further finds that the surprisal-novelty association peaks early in training and subsequently declines, in parallel with a timed increase in the surprisal-frequency association, suggesting that lexical frequency rather than contextual predictability may drive prior findings on metaphor processing.
Significance. If the central claims survive appropriate statistical controls for shared variance, the work would be significant for computational psycholinguistics and NLP. It challenges the interpretation of LM surprisal as a direct proxy for predictability in metaphor novelty and processing difficulty studies, while leveraging a large set of model checkpoints and dual frequency measures to strengthen the empirical case. This could prompt re-examination of surprisal-based explanations in related domains.
major comments (2)
- [Results] Results section: the claim that word frequency is a 'stronger predictor' of metaphor novelty than surprisal rests on separate associations (Pearson correlations or univariate regressions) rather than a joint model. Because surprisal and frequency are known to correlate, this does not establish unique explanatory power; a multiple regression or partial-correlation analysis controlling for their shared variance is required to support the conclusion.
- [Training stages] Training-stage analysis (likely §4 or equivalent): the reported peak-then-decline pattern in the surprisal-novelty association, and its mirroring of the surprisal-frequency rise, needs to be re-evaluated after partialling out the other variable. Without such controls, the timing alignment could be an artifact of the underlying correlation rather than an independent developmental trajectory.
minor comments (2)
- [Methods] Clarify the exact statistical tests and any data exclusion criteria used for the 154 checkpoints and novelty ratings in the methods section.
- [Figures] Ensure all figures plotting correlations across training stages include confidence intervals or significance markers for the key associations.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback on our manuscript. We appreciate the opportunity to clarify and strengthen our analyses regarding the frequency confound in LM surprisal for metaphor novelty. Below, we address each major comment point by point.
read point-by-point responses
-
Referee: [Results] Results section: the claim that word frequency is a 'stronger predictor' of metaphor novelty than surprisal rests on separate associations (Pearson correlations or univariate regressions) rather than a joint model. Because surprisal and frequency are known to correlate, this does not establish unique explanatory power; a multiple regression or partial-correlation analysis controlling for their shared variance is required to support the conclusion.
Authors: We agree that separate correlations do not fully establish unique explanatory power given the known correlation between surprisal and frequency. To address this, we have conducted additional multiple regression analyses where metaphor novelty is regressed on both surprisal and frequency simultaneously, as well as partial correlations. In the revised manuscript, we report that frequency remains a significant predictor even after controlling for surprisal, whereas surprisal's unique contribution is weaker or non-significant across most model sizes and checkpoints. This supports our original claim while providing a more rigorous test of unique variance. revision: yes
-
Referee: [Training stages] Training-stage analysis (likely §4 or equivalent): the reported peak-then-decline pattern in the surprisal-novelty association, and its mirroring of the surprisal-frequency rise, needs to be re-evaluated after partialling out the other variable. Without such controls, the timing alignment could be an artifact of the underlying correlation rather than an independent developmental trajectory.
Authors: We acknowledge the importance of controlling for the other variable in the training-stage analyses to rule out artifacts from their correlation. We have re-analyzed the data using partial correlations: specifically, the partial correlation between surprisal and novelty controlling for frequency at each checkpoint, and vice versa where relevant. The results show that the early peak and subsequent decline in the surprisal-novelty association persists after partialling out frequency, although the magnitude is reduced. Similarly, the increase in surprisal-frequency association over training remains evident. We have updated the relevant figures and text in the revision to include these controlled analyses, which reinforce the interpretation that lexical frequency plays a key role. revision: yes
Circularity Check
No circularity: empirical correlation study with independent inputs
full rationale
The paper performs an empirical analysis of pre-existing metaphor novelty ratings against surprisal values computed from public Pythia checkpoints and two external frequency measures. No derivation chain, fitted parameters, or predictions are defined in terms of the target associations; the reported Pearson correlations and training-stage trends are computed directly from these independent data sources. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear in the load-bearing claims. The central findings rest on observable data patterns rather than reducing to the paper's own equations or prior author work by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human metaphor novelty ratings provide a reliable ground-truth measure of processing difficulty or novelty that can be compared directly to model-derived quantities.
Reference graph
Works this paper leans on
-
[1]
What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length
Tjuatja, Lindia and Neubig, Graham and Linzen, Tal and Hao, Sophie. What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1...
-
[2]
Word Frequency and Predictability Dissociate in Naturalistic Reading , author =. Open Mind , year =
-
[3]
The Career of Metaphor , author =. Psychological Review , year =
-
[4]
Reimann, Sebastian and Scheffler, Tatjana. When is a Metaphor Actually Novel? Annotating Metaphor Novelty in the Context of Automatic Metaphor Detection. Proceedings of the 18th Linguistic Annotation Workshop (LAW-XVIII). 2024
work page 2024
- [5]
-
[6]
Scientific American , volume =
The Origin of Speech , author =. Scientific American , volume =. 1960 , month = sep, doi =
work page 1960
-
[7]
Procedia Computer Science , volume =
A Comparative Approach to Assessing Linguistic Creativity of Large Language Models and Humans , author =. Procedia Computer Science , volume =. 2025 , doi =
work page 2025
-
[8]
Goodkind, Adam and Bicknell, Klinton. Predictive power of word surprisal for reading times is a linear function of language model quality. Proceedings of the 8th Workshop on Cognitive Modeling and Computational Linguistics ( CMCL 2018). 2018. doi:10.18653/v1/W18-0102
-
[9]
Oh, Byung-Doh and Yue, Shisen and Schuler, William. Frequency Explains the Inverse Correlation of Large Language Models' Size, Training Data Amount, and Surprisal ' s Fit to Reading Times. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.eacl-long.162
-
[10]
Oh, Byung-Doh and Schuler, William. Transformer-Based Language Model Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.128
-
[11]
Momen, Omar and Sitter, Emilie and Herrmann, Berenike and Zarrie , Sina. Surprisal and Metaphor Novelty Judgments: Moderate Correlations and Divergent Scaling Effects Revealed by Corpus-Based and Synthetic Datasets. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026...
-
[12]
Oh, Byung-Doh and Schuler, William. Leading Whitespaces of Language Models' Subword Vocabulary Pose a Confound for Calculating Word Probabilities. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.202
-
[13]
How to Compute the Probability of a Word
Pimentel, Tiago and Meister, Clara. How to Compute the Probability of a Word. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1020
-
[14]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. 2020 , eprint=
work page 2020
-
[15]
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. 2023 , eprint=
work page 2023
-
[16]
Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation
Kiritchenko, Svetlana and Mohammad, Saif. Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2017. doi:10.18653/v1/P17-2074
-
[17]
Weeding out Conventionalized Metaphors: A Corpus of Novel Metaphor Annotations
Do Dinh, Erik-L \^a n and Wieland, Hannah and Gurevych, Iryna. Weeding out Conventionalized Metaphors: A Corpus of Novel Metaphor Annotations. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1171
-
[18]
Proceedings of the 43rd Annual Meeting of the Cognitive Science Society , year =
Episodic Memory Demands Modulate Novel Metaphor Use during Event Narration , author =. Proceedings of the 43rd Annual Meeting of the Cognitive Science Society , year =
-
[19]
Ethan Gotlieb Wilcox and Michael Y. Hu and Aaron Mueller and Alex Warstadt and Leshem Choshen and Chengxu Zhuang and Adina Williams and Ryan Cotterell and Tal Linzen , keywords =. Bigger is not always better: The importance of human-scale language modeling for psycholinguistics , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.jml.2025.104650 , url =
-
[20]
Proceedings of the Royal Society of London , volume =
Pearson, Karl , title =. Proceedings of the Royal Society of London , volume =. 1895 , doi =
-
[21]
The American Journal of Psychology , volume =
Spearman, Charles , title =. The American Journal of Psychology , volume =. 1904 , doi =
work page 1904
- [22]
-
[23]
Hanley, James A. and McNeil, Barbara J. , title =. Radiology , volume =. 1982 , doi =
work page 1982
-
[24]
Pattern Recognition Letters , volume =
Fawcett, Tom , title =. Pattern Recognition Letters , volume =. 2006 , doi =
work page 2006
- [25]
- [26]
-
[27]
Causal Estimation of Tokenisation Bias
Lesci, Pietro and Meister, Clara and Hofmann, Thomas and Vlachos, Andreas and Pimentel, Tiago. Causal Estimation of Tokenisation Bias. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1374
-
[28]
Gibbs, Jr, Raymond W. , year=. Embodiment and Cognitive Science , publisher=
-
[29]
Arzouan, Yossi and Goldstein, Abraham and Faust, Miriam , title =. Brain Research , year =
-
[30]
Kövecses, Zoltán , title =. 2002 , month =. doi:10.1093/oso/9780195145113.001.0001 , url =
-
[31]
Journal of Experimental Psychology: Learning, Memory, and Cognition , year =
Effects of Familiarity and Aptness on Metaphor Processing , author =. Journal of Experimental Psychology: Learning, Memory, and Cognition , year =
-
[32]
From Novel to Familiar: Tuning the Brain for Metaphors , author =. NeuroImage , year =
-
[33]
Large Language Model Displays Emergent Ability to Interpret Novel Literary Metaphors , author=. 2024 , eprint=
work page 2024
-
[34]
Jey Han Lau and Alexander Clark and Shalom Lappin , title =. Cognitive Science , year =
-
[35]
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , month =
Revisiting the Uniform Information Density Hypothesis , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , month =. 2021 , address =. doi:10.18653/v1/2021.emnlp-main.74 , pages =
-
[36]
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , month =
A Systematic Assessment of Syntactic Generalization in Neural Language Models , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , month =. 2020 , address =. doi:10.18653/v1/2020.acl-main.158 , pages =
-
[37]
Expectation-based syntactic comprehension , author=. Cognition , volume=. 2008 , publisher=
work page 2008
-
[38]
Proceedings of the National Academy of Sciences , year =
Cory Shain and Clara Meister and Tiago Pimentel and Ryan Cotterell and Roger Levy , title =. Proceedings of the National Academy of Sciences , volume =. 2024 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.2307876121 , abstract =
-
[39]
Ethan Gotlieb Wilcox and Jon Gauthier and Jennifer Hu and Peng Qian and Roger Levy , title =. CoRR , volume =. 2020 , url =. 2006.01912 , timestamp =
-
[40]
Journal of Psycholinguistic Research , volume =
Making the Unseen Seen: The Role of Signaling and Novelty in Rating Metaphors , author =. Journal of Psycholinguistic Research , volume =. 2024 , doi =
work page 2024
-
[41]
Roncero, C. and de Almeida, R. G. , title =. Language and Cognition , volume =. 2014 , doi =
work page 2014
-
[42]
Cardillo, E. R. and Watson, C. E. and Schmidt, G. L. and Kranjec, A. and Chatterjee, A. , title =. Frontiers in Psychology , volume =. 2012 , doi =
work page 2012
-
[43]
Cardillo, E. R. and Schmidt, G. L. and Kranjec, A. and Chatterjee, A. , title =. Behavior Research Methods , volume =. 2010 , doi =
work page 2010
-
[44]
Introducing the LCC Metaphor Datasets
Mohler, Michael and Brunson, Mary and Rink, Bryan and Tomlinson, Marc. Introducing the LCC Metaphor Datasets. Proceedings of the Tenth International Conference on Language Resources and Evaluation ( LREC '16). 2016
work page 2016
-
[45]
Gudrun and Burgers, Christian and Krennmayr, Tina and Steen, Gerard J
Reijnierse, W. Gudrun and Burgers, Christian and Krennmayr, Tina and Steen, Gerard J. , title =. Corpora , volume =. 2019 , doi =. https://doi.org/10.3366/cor.2019.0176 , abstract =
-
[46]
On the Role of Context in Reading Time Prediction
Opedal, Andreas and Chodroff, Eleanor and Cotterell, Ryan and Wilcox, Ethan. On the Role of Context in Reading Time Prediction. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.179
-
[47]
Gill Philip , title =. The Routledge Handbook of Metaphor and Language , editor =. 2016 , doi = "10.4324/9781315672953", note =
-
[48]
Metaphorical Polysemy Detection: Conventional Metaphor Meets Word Sense Disambiguation
Maudslay, Rowan Hall and Teufel, Simone. Metaphorical Polysemy Detection: Conventional Metaphor Meets Word Sense Disambiguation. Proceedings of the 29th International Conference on Computational Linguistics. 2022
work page 2022
-
[49]
Scaling in Cognitive Modelling: a Multilingual Approach to Human Reading Times
de Varda, Andrea and Marelli, Marco. Scaling in Cognitive Modelling: a Multilingual Approach to Human Reading Times. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023. doi:10.18653/v1/2023.acl-short.14
- [50]
-
[51]
Oh, Byung-Doh and Schuler, William. Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00548
-
[52]
The Inverse Scaling Effect of Pre-Trained Language Model Surprisal Is Not Due to Data Leakage
Oh, Byung-Doh and Zhu, Hongao and Schuler, William. The Inverse Scaling Effect of Pre-Trained Language Model Surprisal Is Not Due to Data Leakage. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.91
- [53]
-
[54]
Hale, John , title =. Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies , pages =. 2001 , publisher =. doi:10.3115/1073336.1073357 , abstract =
-
[55]
arXiv preprint arXiv:2303.13988 , year=
Machine psychology , author=. arXiv preprint arXiv:2303.13988 , year=
-
[56]
Proceedings of the National Academy of Sciences , volume=
Using cognitive psychology to understand GPT-3 , author=. Proceedings of the National Academy of Sciences , volume=. 2023 , publisher=
work page 2023
-
[57]
On the Predictive Power of Neural Language Models for Human Real-Time Comprehension Behavior , author=. 2020 , eprint=
work page 2020
-
[58]
Proceedings of the National Academy of Sciences , volume =
Jennifer Hu and Kyle Mahowald and Gary Lupyan and Anna Ivanova and Roger Levy , title =. Proceedings of the National Academy of Sciences , volume =. 2024 , doi =
work page 2024
-
[59]
Psychological Bulletin , volume =
Eye movements in reading and information processing: 20 years of research , author =. Psychological Bulletin , volume =
-
[60]
Nathaniel J. Smith and Roger Levy , keywords =. The effect of word predictability on reading time is logarithmic , journal =. 2013 , issn =. doi:https://doi.org/10.1016/j.cognition.2013.02.013 , url =
-
[61]
Frontiers in Psychology , volume =
A study on surprisal and semantic relatedness for eye-tracking data prediction , author =. Frontiers in Psychology , volume =. 2023 , pages =. doi:10.3389/fpsyg.2023.1112365 , url =
-
[62]
Eye Tracking Based Cognitive Evaluation of Automatic Readability Assessment Measures , author=. 2025 , eprint=
work page 2025
-
[63]
Claude E. Shannon , title =. Bell System Technical Journal , volume =. 1948 , note =
work page 1948
- [64]
-
[65]
MIP: A Method for Identifying Metaphorically Used Words in Discourse , journal =. 2007 , volume =
work page 2007
-
[66]
Comprehending conventional and novel metaphors: An ERP study , journal =
Vicky Tzuyin Lai and Tim Curran and Lise Menn , keywords =. Comprehending conventional and novel metaphors: An ERP study , journal =. 2009 , issn =. doi:https://doi.org/10.1016/j.brainres.2009.05.088 , url =
-
[67]
Warner, Benjamin and Chaffin, Antoine and Clavi. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.127
-
[68]
DeBERTa: Decoding-enhanced BERT with Disentangled Attention , author=. 2021 , eprint=
work page 2021
- [69]
- [70]
-
[71]
Language Models are Unsupervised Multitask Learners , author =. 2019 , journal =
work page 2019
-
[72]
G.J. Steen and A.G. Dorst and J.B. Herrmann and A.A. Kaal and T. Krennmayr and T. Pasma. A method for linguistic metaphor identification. From MIP to MIPVU. 2010
work page 2010
- [73]
-
[74]
Publications Manual , year = "1983", publisher =
work page 1983
-
[75]
Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243
- [76]
-
[77]
Dan Gusfield , title =. 1997
work page 1997
-
[78]
Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =
work page 2015
-
[79]
A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =
Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =
-
[80]
Dong, Chuanming and Gambette, Philippe and Dominguès, Catherine , month = oct, year =. Extracting. doi:10.5220/0010656700003064 , abstract =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.