pith. sign in

arxiv: 2605.06506 · v2 · pith:UC5ZIRQInew · submitted 2026-05-07 · 💻 cs.CL

The Frequency Confound in Language-Model Surprisal and Metaphor Novelty

Pith reviewed 2026-05-20 22:55 UTC · model grok-4.3

classification 💻 cs.CL
keywords surprisalmetaphor noveltylexical frequencylanguage modelsPythiatraining checkpointsconfound analysis
0
0 comments X

The pith

Word frequency predicts metaphor novelty judgments more strongly than language model surprisal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates why language model surprisal correlates with human ratings of metaphor novelty. It compares surprisal estimates from multiple Pythia model sizes and training stages against two lexical frequency measures. Frequency consistently outperforms surprisal as a predictor. The surprisal-novelty link peaks early in training and then weakens, tracking a parallel rise in how closely surprisal tracks frequency. This pattern indicates that frequency effects may underlie many reported surprisal findings on novelty and processing difficulty.

Core claim

Across eight Pythia model sizes and 154 training checkpoints, word frequency measures prove stronger predictors of metaphor novelty ratings than surprisal estimates. The surprisal-novelty association reaches its peak at an early training stage before declining, which coincides with a corresponding strengthening of the surprisal-frequency association at the same stage.

What carries the argument

Correlation comparison of surprisal versus two lexical frequency measures as predictors of metaphor novelty ratings, tracked across model scales and training checkpoints.

If this is right

  • Reported optimal surprisal settings for modeling metaphor novelty may reflect frequency confounds rather than contextual predictability.
  • Lexical frequency may serve as the primary underlying factor in associations between surprisal and processing difficulty.
  • Surprisal from later training stages adds little predictive value for novelty once frequency is accounted for.
  • Studies relying on surprisal for cognitive modeling of metaphors should include frequency controls to isolate true contextual effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future work could test whether simpler frequency-only models match or exceed surprisal performance on novelty prediction tasks.
  • Similar frequency confounds may affect surprisal correlations in other domains such as sentence acceptability or reading time studies.
  • Reanalyzing prior surprisal papers on linguistic novelty with frequency controls could clarify which effects are genuinely contextual.

Load-bearing premise

The metaphor novelty ratings dataset reflects human judgments without residual influence from lexical frequency, and the two frequency measures capture the full confound.

What would settle it

Statistically partialling frequency out of surprisal and finding that the remaining surprisal-novelty correlation stays significant would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.06506 by Omar Momen, Sina Zarrie{\ss}.

Figure 1
Figure 1. Figure 1: Effect of model size on associations between Metaphor Novelty Scores and Surprisal (solid); Negative Log Word Frequency in general language use (NLF-Human) (dash); and Negative Log Word Frequency in Pythia’s pretraining data (NLF-LM) (dots). Blue lines track Spearman correlation, and red lines track AUC to detect novel metaphors (score ≥ 0.5). 2M 4M 8M 17M 34M 67M 134M 268M 537M 1B 2B 4B 9B 17B 34B 69B 137… view at source ↗
Figure 2
Figure 2. Figure 2: Effect of pretraining data/steps for Pythia-70M on associations between Metaphor Novelty Scores and Surprisal (solid); Negative Log Word Frequency in general language use (NLF-Human) (dash); and Negative Log Word Frequency in Pythia’s pretraining data (NLF-LM) (dots). Blue lines track Spearman correlation, and red lines track AUC. 70M 160M 410M 1B 1.4B 2.8B 6.9B 12B 0.3 0.4 0.5 0.6 0.7 0.8 Surprisal vs. NL… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of model scale on correlation between Sur￾prisal and Frequency. NLF-Human (dash); and NLF-LM (dots). 4 Discussion Word Frequency: Our results agree with previous work (Do Dinh et al., 2018; Reimann and Scheffler, 2024) showing that lexical frequency is strongly associated with metaphor novelty scores. Addition￾ally, we show that frequency–novelty association is substantially stronger than surprisal–… view at source ↗
read the original abstract

Language-model (LM) surprisal is widely used as a proxy for contextual predictability and has been reported to correlate with metaphor novelty judgments. However, surprisal is tightly intertwined with lexical frequency. We explore this interaction on metaphor novelty ratings using two different word frequency measures. We analyse surprisal estimates from eight Pythia model sizes and 154 training checkpoints. Across settings, word frequency is a stronger predictor of metaphor novelty than surprisal. Across training stages, the surprisal--novelty association peaks at an early stage and then falls again, mirroring a similarly timed increase in the surprisal--frequency association. These results suggest that the often-reported optimal LM surprisal settings may incorrectly associate contextual predictability with metaphor novelty and processing difficulty, whereas lexical frequency may be the major underlying factor.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates the potential frequency confound in using language-model surprisal as a predictor of metaphor novelty ratings. It analyzes surprisal estimates from eight Pythia model sizes across 154 training checkpoints, paired with two distinct word frequency measures, and reports that frequency is a stronger predictor of novelty judgments than surprisal across settings. It further finds that the surprisal-novelty association peaks early in training and subsequently declines, in parallel with a timed increase in the surprisal-frequency association, suggesting that lexical frequency rather than contextual predictability may drive prior findings on metaphor processing.

Significance. If the central claims survive appropriate statistical controls for shared variance, the work would be significant for computational psycholinguistics and NLP. It challenges the interpretation of LM surprisal as a direct proxy for predictability in metaphor novelty and processing difficulty studies, while leveraging a large set of model checkpoints and dual frequency measures to strengthen the empirical case. This could prompt re-examination of surprisal-based explanations in related domains.

major comments (2)
  1. [Results] Results section: the claim that word frequency is a 'stronger predictor' of metaphor novelty than surprisal rests on separate associations (Pearson correlations or univariate regressions) rather than a joint model. Because surprisal and frequency are known to correlate, this does not establish unique explanatory power; a multiple regression or partial-correlation analysis controlling for their shared variance is required to support the conclusion.
  2. [Training stages] Training-stage analysis (likely §4 or equivalent): the reported peak-then-decline pattern in the surprisal-novelty association, and its mirroring of the surprisal-frequency rise, needs to be re-evaluated after partialling out the other variable. Without such controls, the timing alignment could be an artifact of the underlying correlation rather than an independent developmental trajectory.
minor comments (2)
  1. [Methods] Clarify the exact statistical tests and any data exclusion criteria used for the 154 checkpoints and novelty ratings in the methods section.
  2. [Figures] Ensure all figures plotting correlations across training stages include confidence intervals or significance markers for the key associations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback on our manuscript. We appreciate the opportunity to clarify and strengthen our analyses regarding the frequency confound in LM surprisal for metaphor novelty. Below, we address each major comment point by point.

read point-by-point responses
  1. Referee: [Results] Results section: the claim that word frequency is a 'stronger predictor' of metaphor novelty than surprisal rests on separate associations (Pearson correlations or univariate regressions) rather than a joint model. Because surprisal and frequency are known to correlate, this does not establish unique explanatory power; a multiple regression or partial-correlation analysis controlling for their shared variance is required to support the conclusion.

    Authors: We agree that separate correlations do not fully establish unique explanatory power given the known correlation between surprisal and frequency. To address this, we have conducted additional multiple regression analyses where metaphor novelty is regressed on both surprisal and frequency simultaneously, as well as partial correlations. In the revised manuscript, we report that frequency remains a significant predictor even after controlling for surprisal, whereas surprisal's unique contribution is weaker or non-significant across most model sizes and checkpoints. This supports our original claim while providing a more rigorous test of unique variance. revision: yes

  2. Referee: [Training stages] Training-stage analysis (likely §4 or equivalent): the reported peak-then-decline pattern in the surprisal-novelty association, and its mirroring of the surprisal-frequency rise, needs to be re-evaluated after partialling out the other variable. Without such controls, the timing alignment could be an artifact of the underlying correlation rather than an independent developmental trajectory.

    Authors: We acknowledge the importance of controlling for the other variable in the training-stage analyses to rule out artifacts from their correlation. We have re-analyzed the data using partial correlations: specifically, the partial correlation between surprisal and novelty controlling for frequency at each checkpoint, and vice versa where relevant. The results show that the early peak and subsequent decline in the surprisal-novelty association persists after partialling out frequency, although the magnitude is reduced. Similarly, the increase in surprisal-frequency association over training remains evident. We have updated the relevant figures and text in the revision to include these controlled analyses, which reinforce the interpretation that lexical frequency plays a key role. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical correlation study with independent inputs

full rationale

The paper performs an empirical analysis of pre-existing metaphor novelty ratings against surprisal values computed from public Pythia checkpoints and two external frequency measures. No derivation chain, fitted parameters, or predictions are defined in terms of the target associations; the reported Pearson correlations and training-stage trends are computed directly from these independent data sources. No self-citation load-bearing steps, ansatz smuggling, or renaming of known results appear in the load-bearing claims. The central findings rest on observable data patterns rather than reducing to the paper's own equations or prior author work by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the validity of an existing metaphor novelty rating dataset and the assumption that the two frequency measures are appropriate and independent of the surprisal estimates. No new entities or free parameters are introduced; the work uses off-the-shelf Pythia checkpoints.

axioms (1)
  • domain assumption Human metaphor novelty ratings provide a reliable ground-truth measure of processing difficulty or novelty that can be compared directly to model-derived quantities.
    Invoked implicitly when treating novelty ratings as the dependent variable to be predicted by surprisal or frequency.

pith-pipeline@v0.9.0 · 5659 in / 1274 out tokens · 38952 ms · 2026-05-20T22:55:07.800279+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages

  1. [1]

    What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length

    Tjuatja, Lindia and Neubig, Graham and Linzen, Tal and Hao, Sophie. What Goes Into a LM Acceptability Judgment? Rethinking the Impact of Frequency and Length. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1...

  2. [2]

    Open Mind , year =

    Word Frequency and Predictability Dissociate in Naturalistic Reading , author =. Open Mind , year =

  3. [3]

    Psychological Review , year =

    The Career of Metaphor , author =. Psychological Review , year =

  4. [4]

    When is a Metaphor Actually Novel? Annotating Metaphor Novelty in the Context of Automatic Metaphor Detection

    Reimann, Sebastian and Scheffler, Tatjana. When is a Metaphor Actually Novel? Annotating Metaphor Novelty in the Context of Automatic Metaphor Detection. Proceedings of the 18th Linguistic Annotation Workshop (LAW-XVIII). 2024

  5. [5]

    2006 , edition =

    The Study of Language , author =. 2006 , edition =

  6. [6]

    Scientific American , volume =

    The Origin of Speech , author =. Scientific American , volume =. 1960 , month = sep, doi =

  7. [7]

    Procedia Computer Science , volume =

    A Comparative Approach to Assessing Linguistic Creativity of Large Language Models and Humans , author =. Procedia Computer Science , volume =. 2025 , doi =

  8. [8]

    , author Bicknell, K

    Goodkind, Adam and Bicknell, Klinton. Predictive power of word surprisal for reading times is a linear function of language model quality. Proceedings of the 8th Workshop on Cognitive Modeling and Computational Linguistics ( CMCL 2018). 2018. doi:10.18653/v1/W18-0102

  9. [9]

    Frequency Explains the Inverse Correlation of Large Language Models' Size, Training Data Amount, and Surprisal ' s Fit to Reading Times

    Oh, Byung-Doh and Yue, Shisen and Schuler, William. Frequency Explains the Inverse Correlation of Large Language Models' Size, Training Data Amount, and Surprisal ' s Fit to Reading Times. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.eacl-long.162

  10. [10]

    Transformer-Based Language Model Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens

    Oh, Byung-Doh and Schuler, William. Transformer-Based Language Model Surprisal Predicts Human Reading Times Best with About Two Billion Training Tokens. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.128

  11. [11]

    Surprisal and Metaphor Novelty Judgments: Moderate Correlations and Divergent Scaling Effects Revealed by Corpus-Based and Synthetic Datasets

    Momen, Omar and Sitter, Emilie and Herrmann, Berenike and Zarrie , Sina. Surprisal and Metaphor Novelty Judgments: Moderate Correlations and Divergent Scaling Effects Revealed by Corpus-Based and Synthetic Datasets. Proceedings of the 19th Conference of the E uropean Chapter of the A ssociation for C omputational L inguistics (Volume 1: Long Papers). 2026...

  12. [12]

    Leading Whitespaces of Language Models' Subword Vocabulary Pose a Confound for Calculating Word Probabilities

    Oh, Byung-Doh and Schuler, William. Leading Whitespaces of Language Models' Subword Vocabulary Pose a Confound for Calculating Word Probabilities. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.202

  13. [13]

    How to Compute the Probability of a Word

    Pimentel, Tiago and Meister, Clara. How to Compute the Probability of a Word. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1020

  14. [14]

    2020 , eprint=

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. 2020 , eprint=

  15. [15]

    2023 , eprint=

    Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , author=. 2023 , eprint=

  16. [16]

    Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation

    Kiritchenko, Svetlana and Mohammad, Saif. Best-Worst Scaling More Reliable than Rating Scales: A Case Study on Sentiment Intensity Annotation. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2017. doi:10.18653/v1/P17-2074

  17. [17]

    Weeding out Conventionalized Metaphors: A Corpus of Novel Metaphor Annotations

    Do Dinh, Erik-L \^a n and Wieland, Hannah and Gurevych, Iryna. Weeding out Conventionalized Metaphors: A Corpus of Novel Metaphor Annotations. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2018. doi:10.18653/v1/D18-1171

  18. [18]

    Proceedings of the 43rd Annual Meeting of the Cognitive Science Society , year =

    Episodic Memory Demands Modulate Novel Metaphor Use during Event Narration , author =. Proceedings of the 43rd Annual Meeting of the Cognitive Science Society , year =

  19. [19]

    Hu and Aaron Mueller and Alex Warstadt and Leshem Choshen and Chengxu Zhuang and Adina Williams and Ryan Cotterell and Tal Linzen , keywords =

    Ethan Gotlieb Wilcox and Michael Y. Hu and Aaron Mueller and Alex Warstadt and Leshem Choshen and Chengxu Zhuang and Adina Williams and Ryan Cotterell and Tal Linzen , keywords =. Bigger is not always better: The importance of human-scale language modeling for psycholinguistics , journal =. 2025 , issn =. doi:https://doi.org/10.1016/j.jml.2025.104650 , url =

  20. [20]

    Proceedings of the Royal Society of London , volume =

    Pearson, Karl , title =. Proceedings of the Royal Society of London , volume =. 1895 , doi =

  21. [21]

    The American Journal of Psychology , volume =

    Spearman, Charles , title =. The American Journal of Psychology , volume =. 1904 , doi =

  22. [22]

    , title =

    Cureton, Edward E. , title =. Psychometrika , volume =. 1956 , doi =

  23. [23]

    and McNeil, Barbara J

    Hanley, James A. and McNeil, Barbara J. , title =. Radiology , volume =. 1982 , doi =

  24. [24]

    Pattern Recognition Letters , volume =

    Fawcett, Tom , title =. Pattern Recognition Letters , volume =. 2006 , doi =

  25. [25]

    , title =

    Bradley, Andrew P. , title =. Pattern Recognition , volume =. 1997 , doi =

  26. [26]

    , title =

    Glass, Gene V. , title =. Educational and Psychological Measurement , volume =. 1966 , doi =

  27. [27]

    Causal Estimation of Tokenisation Bias

    Lesci, Pietro and Meister, Clara and Hofmann, Thomas and Vlachos, Andreas and Pimentel, Tiago. Causal Estimation of Tokenisation Bias. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1374

  28. [28]

    Gibbs, Jr, Raymond W. , year=. Embodiment and Cognitive Science , publisher=

  29. [29]

    Brain Research , year =

    Arzouan, Yossi and Goldstein, Abraham and Faust, Miriam , title =. Brain Research , year =

  30. [30]

    , year 2007

    Kövecses, Zoltán , title =. 2002 , month =. doi:10.1093/oso/9780195145113.001.0001 , url =

  31. [31]

    Journal of Experimental Psychology: Learning, Memory, and Cognition , year =

    Effects of Familiarity and Aptness on Metaphor Processing , author =. Journal of Experimental Psychology: Learning, Memory, and Cognition , year =

  32. [32]

    NeuroImage , year =

    From Novel to Familiar: Tuning the Brain for Metaphors , author =. NeuroImage , year =

  33. [33]

    2024 , eprint=

    Large Language Model Displays Emergent Ability to Interpret Novel Literary Metaphors , author=. 2024 , eprint=

  34. [34]

    Cognitive Science , year =

    Jey Han Lau and Alexander Clark and Shalom Lappin , title =. Cognitive Science , year =

  35. [35]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , month =

    Revisiting the Uniform Information Density Hypothesis , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , month =. 2021 , address =. doi:10.18653/v1/2021.emnlp-main.74 , pages =

  36. [36]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , month =

    A Systematic Assessment of Syntactic Generalization in Neural Language Models , author =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , month =. 2020 , address =. doi:10.18653/v1/2020.acl-main.158 , pages =

  37. [37]

    Cognition , volume=

    Expectation-based syntactic comprehension , author=. Cognition , volume=. 2008 , publisher=

  38. [38]

    Proceedings of the National Academy of Sciences , year =

    Cory Shain and Clara Meister and Tiago Pimentel and Ryan Cotterell and Roger Levy , title =. Proceedings of the National Academy of Sciences , volume =. 2024 , doi =. https://www.pnas.org/doi/pdf/10.1073/pnas.2307876121 , abstract =

  39. [39]

    CoRR , volume =

    Ethan Gotlieb Wilcox and Jon Gauthier and Jennifer Hu and Peng Qian and Roger Levy , title =. CoRR , volume =. 2020 , url =. 2006.01912 , timestamp =

  40. [40]

    Journal of Psycholinguistic Research , volume =

    Making the Unseen Seen: The Role of Signaling and Novelty in Rating Metaphors , author =. Journal of Psycholinguistic Research , volume =. 2024 , doi =

  41. [41]

    and de Almeida, R

    Roncero, C. and de Almeida, R. G. , title =. Language and Cognition , volume =. 2014 , doi =

  42. [42]

    Cardillo, E. R. and Watson, C. E. and Schmidt, G. L. and Kranjec, A. and Chatterjee, A. , title =. Frontiers in Psychology , volume =. 2012 , doi =

  43. [43]

    Cardillo, E. R. and Schmidt, G. L. and Kranjec, A. and Chatterjee, A. , title =. Behavior Research Methods , volume =. 2010 , doi =

  44. [44]

    Introducing the LCC Metaphor Datasets

    Mohler, Michael and Brunson, Mary and Rink, Bryan and Tomlinson, Marc. Introducing the LCC Metaphor Datasets. Proceedings of the Tenth International Conference on Language Resources and Evaluation ( LREC '16). 2016

  45. [45]

    Gudrun and Burgers, Christian and Krennmayr, Tina and Steen, Gerard J

    Reijnierse, W. Gudrun and Burgers, Christian and Krennmayr, Tina and Steen, Gerard J. , title =. Corpora , volume =. 2019 , doi =. https://doi.org/10.3366/cor.2019.0176 , abstract =

  46. [46]

    On the Role of Context in Reading Time Prediction

    Opedal, Andreas and Chodroff, Eleanor and Cotterell, Ryan and Wilcox, Ethan. On the Role of Context in Reading Time Prediction. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.179

  47. [47]

    10.4324/9781315672953

    Gill Philip , title =. The Routledge Handbook of Metaphor and Language , editor =. 2016 , doi = "10.4324/9781315672953", note =

  48. [48]

    Metaphorical Polysemy Detection: Conventional Metaphor Meets Word Sense Disambiguation

    Maudslay, Rowan Hall and Teufel, Simone. Metaphorical Polysemy Detection: Conventional Metaphor Meets Word Sense Disambiguation. Proceedings of the 29th International Conference on Computational Linguistics. 2022

  49. [49]

    Scaling in Cognitive Modelling: a Multilingual Approach to Human Reading Times

    de Varda, Andrea and Marelli, Marco. Scaling in Cognitive Modelling: a Multilingual Approach to Human Reading Times. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023. doi:10.18653/v1/2023.acl-short.14

  50. [50]

    1980 , note =

    George Lakoff and Mark Johnson , title =. 1980 , note =

  51. [51]

    Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?

    Oh, Byung-Doh and Schuler, William. Why Does Surprisal From Larger Transformer-Based Language Models Provide a Poorer Fit to Human Reading Times?. Transactions of the Association for Computational Linguistics. 2023. doi:10.1162/tacl_a_00548

  52. [52]

    The Inverse Scaling Effect of Pre-Trained Language Model Surprisal Is Not Due to Data Leakage

    Oh, Byung-Doh and Zhu, Hongao and Schuler, William. The Inverse Scaling Effect of Pre-Trained Language Model Surprisal Is Not Due to Data Leakage. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.91

  53. [53]

    Cognition , volume =

    Levy, Roger , title =. Cognition , volume =. 2008 , doi =

  54. [54]

    A Probabilistic

    Hale, John , title =. Proceedings of the Second Meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies , pages =. 2001 , publisher =. doi:10.3115/1073336.1073357 , abstract =

  55. [55]

    arXiv preprint arXiv:2303.13988 , year=

    Machine psychology , author=. arXiv preprint arXiv:2303.13988 , year=

  56. [56]

    Proceedings of the National Academy of Sciences , volume=

    Using cognitive psychology to understand GPT-3 , author=. Proceedings of the National Academy of Sciences , volume=. 2023 , publisher=

  57. [57]

    2020 , eprint=

    On the Predictive Power of Neural Language Models for Human Real-Time Comprehension Behavior , author=. 2020 , eprint=

  58. [58]

    Proceedings of the National Academy of Sciences , volume =

    Jennifer Hu and Kyle Mahowald and Gary Lupyan and Anna Ivanova and Roger Levy , title =. Proceedings of the National Academy of Sciences , volume =. 2024 , doi =

  59. [59]

    Psychological Bulletin , volume =

    Eye movements in reading and information processing: 20 years of research , author =. Psychological Bulletin , volume =

  60. [60]

    and Levy, Roger , year =

    Nathaniel J. Smith and Roger Levy , keywords =. The effect of word predictability on reading time is logarithmic , journal =. 2013 , issn =. doi:https://doi.org/10.1016/j.cognition.2013.02.013 , url =

  61. [61]

    Frontiers in Psychology , volume =

    A study on surprisal and semantic relatedness for eye-tracking data prediction , author =. Frontiers in Psychology , volume =. 2023 , pages =. doi:10.3389/fpsyg.2023.1112365 , url =

  62. [62]

    2025 , eprint=

    Eye Tracking Based Cognitive Evaluation of Automatic Readability Assessment Measures , author=. 2025 , eprint=

  63. [63]

    Shannon , title =

    Claude E. Shannon , title =. Bell System Technical Journal , volume =. 1948 , note =

  64. [64]

    1980 , publisher =

    George Lakoff and Mark Johnson , title =. 1980 , publisher =

  65. [65]

    2007 , volume =

    MIP: A Method for Identifying Metaphorically Used Words in Discourse , journal =. 2007 , volume =

  66. [66]

    Comprehending conventional and novel metaphors: An ERP study , journal =

    Vicky Tzuyin Lai and Tim Curran and Lise Menn , keywords =. Comprehending conventional and novel metaphors: An ERP study , journal =. 2009 , issn =. doi:https://doi.org/10.1016/j.brainres.2009.05.088 , url =

  67. [67]

    Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

    Warner, Benjamin and Chaffin, Antoine and Clavi. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.127

  68. [68]

    2021 , eprint=

    DeBERTa: Decoding-enhanced BERT with Disentangled Attention , author=. 2021 , eprint=

  69. [69]

    2024 , journal =

    Qwen2.5 Technical Report , author =. 2024 , journal =

  70. [70]

    2024 , journal =

    The Llama 3 Herd of Models , author =. 2024 , journal =

  71. [71]

    2019 , journal =

    Language Models are Unsupervised Multitask Learners , author =. 2019 , journal =

  72. [72]

    Steen and A.G

    G.J. Steen and A.G. Dorst and J.B. Herrmann and A.A. Kaal and T. Krennmayr and T. Pasma. A method for linguistic metaphor identification. From MIP to MIPVU. 2010

  73. [73]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  74. [74]

    Publications Manual , year = "1983", publisher =

  75. [75]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  76. [76]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  77. [77]

    Dan Gusfield , title =. 1997

  78. [78]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  79. [79]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  80. [80]

    Extracting

    Dong, Chuanming and Gambette, Philippe and Dominguès, Catherine , month = oct, year =. Extracting. doi:10.5220/0010656700003064 , abstract =