Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms

Hidetaka Kamigaito; Makoto Morishita; Naoki Shikoda; Ryo Fujii; Taro Watanabe; Yosuke Kishinami; Yuto Nishida

arxiv: 2604.21882 · v1 · submitted 2026-04-23 · 💻 cs.CL

Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms

Yuto Nishida , Naoki Shikoda , Yosuke Kishinami , Ryo Fujii , Makoto Morishita , Hidetaka Kamigaito , Taro Watanabe This is my paper

Pith reviewed 2026-05-09 21:51 UTC · model grok-4.3

classification 💻 cs.CL

keywords non-verbatim memorizationentity surface formslarge language modelsfactual recallWikipedia redirectsknowledge evaluationRedirectQA dataset

0 comments

The pith

Factual recall in large language models changes when the same entity is referred to by different names or spellings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models store facts about entities in a way that is tied to specific names or more abstractly. It builds a dataset that pairs each fact with multiple surface forms drawn from Wikipedia redirects, covering aliases, abbreviations, spelling variants, and common misspellings. Experiments across thirteen models show that accuracy often drops or rises when only the entity name changes, and that small spelling differences are handled more reliably than large lexical shifts like abbreviations. Both the overall frequency of an entity and the frequency of its particular name correlate with performance, with entity frequency adding explanatory power beyond surface frequency alone. This indicates that memorization sits between being locked to one phrasing and being fully independent of phrasing, so single-name tests may not fully reveal what models actually know.

Core claim

The authors introduce RedirectQA, an entity-based QA dataset that associates Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms. Across 13 LLMs, prediction outcomes frequently change when only the entity surface form is altered. This inconsistency is category-dependent, with models more robust to minor orthographic variations than to larger lexical variations such as aliases and abbreviations. Frequency analyses indicate that both entity- and surface-level frequencies are associated with accuracy, and that entity frequency often contributes beyond surface frequency. Overall, the ar

What carries the argument

RedirectQA dataset, which uses Wikipedia redirect categories to pair factual triples with multiple surface forms of each entity and thereby isolate the effect of name variation on recall.

If this is right

Entity QA evaluations that rely on a single canonical name per entity may give an incomplete picture of factual memorization.
Models exhibit greater consistency across minor spelling variants than across aliases or abbreviations.
Both the popularity of an entity and the popularity of its specific surface form influence recall accuracy.
Entity frequency often accounts for additional variance in accuracy after surface frequency is controlled for.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Knowledge benchmarks should routinely include multiple surface forms per entity to measure non-verbatim memorization more reliably.
Training procedures that expose models to varied names for the same entities may reduce surface-form sensitivity in factual recall.
The same surface-form sensitivity could appear in other memorized content, such as events or relations, if tested with comparable variation.

Load-bearing premise

Wikipedia redirect categories can separate surface-form effects from other factors such as entity frequency or question difficulty without selection bias.

What would settle it

A controlled experiment that balances all surface forms for frequency and query difficulty and then checks whether the observed inconsistency across forms disappears.

Figures

Figures reproduced from arXiv: 2604.21882 by Hidetaka Kamigaito, Makoto Morishita, Naoki Shikoda, Ryo Fujii, Taro Watanabe, Yosuke Kishinami, Yuto Nishida.

**Figure 1.** Figure 1: Overview of the RedirectQA construction process: (1) Factual triples are collected from Wikidata. (2) Each subject entity is associated with canonical and redirect surface forms, together with redirect categories, using Wikipedia redirects. (3) Question realizations are generated from surface instances using relation-specific question templates. a surface instance with a relation-specific question template… view at source ↗

**Figure 2.** Figure 2: Prediction consistency between canonical and redirect surface forms on RedirectQA using the original [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Relationship between accuracy and entity/surface frequencies for Pythia-12B. Each point shows the mean [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Prediction consistency between canonical and redirect surface forms on RedirectQA using the paraphrased [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

read the original abstract

Understanding what kinds of factual knowledge large language models (LLMs) memorize is essential for evaluating their reliability and limitations. Entity-based QA is a common framework for analyzing non-verbatim memorization, but typical evaluations query each entity using a single canonical surface form, making it difficult to disentangle fact memorization from access through a particular name. We introduce RedirectQA, an entity-based QA dataset that uses Wikipedia redirect information to associate Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms. Across 13 LLMs, we examine surface-conditioned factual memorization and find that prediction outcomes often change when only the entity surface form changes. This inconsistency is category-dependent: models are more robust to minor orthographic variations than to larger lexical variations such as aliases and abbreviations. Frequency analyses further suggest that both entity- and surface-level frequencies are associated with accuracy, and that entity frequency often contributes beyond surface frequency. Overall, factual memorization appears neither purely surface-specific nor fully surface-invariant, highlighting the importance of surface-form diversity in evaluating non-verbatim memorization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper builds RedirectQA from Wikipedia redirects to test LLM factual recall across entity surface forms and reports category-dependent robustness plus frequency effects, but the categories may not cleanly separate surface variation from popularity confounds.

read the letter

The main takeaway is that LLMs change their answers on the same fact when the entity name shifts, and this paper shows the shift is bigger for aliases and abbreviations than for minor spelling variants. They built RedirectQA by linking Wikidata triples to categorized redirects and ran it on 13 models, which is the concrete new piece here. Most prior entity QA work sticks to one canonical name, so the category split and the finding that robustness is not uniform add something useful. The frequency checks also show entity-level frequency often explains more variance than surface frequency alone, which lines up with what people already suspect about memorization but now has direct numbers attached. The experiments are broad enough to make the patterns visible without obvious cherry-picking in the abstract. The soft spot is the dataset itself. Wikipedia redirects are not neutral; popular entities accumulate more variants, and erroneous forms tend to cluster on less common or harder-to-recall items. The paper reports frequency correlations, yet without explicit matching of entities across categories or controls for query template difficulty, the reported differences between orthographic and lexical variants could partly trace back to those imbalances rather than pure surface-form sensitivity. That does not kill the result, but it does mean the central claim about surface-form dependence rests on an assumption that needs tighter verification in the methods. This is aimed at people who evaluate or improve factual reliability in LLMs. Anyone running entity-based tests or building knowledge-augmented systems can use the dataset or the robustness patterns to make their benchmarks stricter. It has enough scale and a clear empirical hook to deserve a serious referee, even if the review will likely focus on the controls and data splits. I would send it to review rather than desk reject.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces RedirectQA, a dataset constructed from Wikipedia redirects that associates Wikidata factual triples with categorized surface forms (aliases, abbreviations, spelling variants, erroneous forms) for each entity. It evaluates factual QA performance across 13 LLMs and reports that prediction outcomes frequently change when only the entity surface form is varied, with greater robustness to minor orthographic changes than to larger lexical variations. Frequency analyses indicate correlations with both entity-level and surface-level frequencies, where entity frequency often contributes predictive power beyond surface frequency alone. The authors conclude that factual memorization is neither purely surface-specific nor fully surface-invariant, emphasizing the value of surface-form diversity in non-verbatim memorization evaluations.

Significance. If the central empirical patterns hold after addressing potential dataset confounds, the work would meaningfully advance LLM evaluation practices by showing that single-surface-form benchmarks can misestimate factual recall. The RedirectQA construction provides a scalable, real-world-derived resource for testing surface sensitivity, and the frequency results offer a starting point for disentangling memorization from access mechanisms. This could influence benchmark design in knowledge probing and reliability assessment, though its impact depends on the robustness of the category isolation.

major comments (2)

[§3] §3 (RedirectQA Dataset Construction): The use of Wikipedia redirect categories to isolate surface-form effects risks selection bias, as redirect volume and type correlate with entity popularity (more frequent entities have richer redirect graphs) and 'erroneous' forms may cluster on harder-to-recall entities. Without explicit stratification, propensity matching, or multivariate controls for entity frequency and query difficulty (e.g., template complexity or answer rarity) across categories, the reported category-dependent robustness may reflect these imbalances rather than genuine surface-form sensitivity. The frequency analyses mentioned do not appear to include such controls.
[§5] §5 (Experiments): The claim that prediction outcomes 'often change' with surface form and that models are 'more robust' to minor variations requires per-category accuracy tables with statistical tests (e.g., paired significance or effect sizes) and full model list with hyperparameters. The abstract reports patterns across 13 models, but without data splits, query phrasing controls, or ablation on template effects, it is difficult to verify that changes are attributable to surface forms rather than confounds.

minor comments (3)

[Abstract] The abstract and early sections should explicitly list the 13 LLMs evaluated (with sizes and sources) rather than deferring to a later table, to allow immediate assessment of scope.
[§2] Related work section could more directly cite prior studies on alias handling in entity linking and surface-form robustness in QA to better position the novelty of RedirectQA.
[§5] Ensure all figures include error bars, sample sizes per category, and clear legends; some frequency correlation plots appear to lack these based on the described results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important considerations for strengthening our analysis of surface-form sensitivity in LLMs. We respond to each major comment below and commit to revisions that address the raised concerns.

read point-by-point responses

Referee: [§3] §3 (RedirectQA Dataset Construction): The use of Wikipedia redirect categories to isolate surface-form effects risks selection bias, as redirect volume and type correlate with entity popularity (more frequent entities have richer redirect graphs) and 'erroneous' forms may cluster on harder-to-recall entities. Without explicit stratification, propensity matching, or multivariate controls for entity frequency and query difficulty (e.g., template complexity or answer rarity) across categories, the reported category-dependent robustness may reflect these imbalances rather than genuine surface-form sensitivity. The frequency analyses mentioned do not appear to include such controls.

Authors: We agree that selection bias is a valid concern given the nature of Wikipedia redirects. Our current frequency analyses show associations with both entity- and surface-level frequencies, with entity frequency often providing additional predictive power. However, these do not fully control for confounds such as query difficulty or stratification by popularity. In the revised version, we will add explicit stratification by entity frequency bins and include multivariate regression controls that account for entity popularity, answer rarity, and template complexity to better isolate surface-form effects. revision: yes
Referee: [§5] §5 (Experiments): The claim that prediction outcomes 'often change' with surface form and that models are 'more robust' to minor variations requires per-category accuracy tables with statistical tests (e.g., paired significance or effect sizes) and full model list with hyperparameters. The abstract reports patterns across 13 models, but without data splits, query phrasing controls, or ablation on template effects, it is difficult to verify that changes are attributable to surface forms rather than confounds.

Authors: We acknowledge the need for greater statistical detail and transparency. The manuscript presents aggregated results across 13 LLMs demonstrating category-dependent changes in predictions. To address this, we will include in the revised §5: (1) per-category accuracy tables with paired t-tests or McNemar's tests for significance and effect sizes (e.g., Cohen's d); (2) the complete list of models and their hyperparameters; (3) details on data splits and query phrasing controls; and (4) ablations examining template effects to confirm attribution to surface forms. These additions will allow verification of the robustness patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and standard QA evaluation

full rationale

The paper constructs RedirectQA by associating Wikidata triples with Wikipedia redirect surface forms and runs off-the-shelf LLM QA evaluations across 13 models. No equations, fitted parameters, or derivations appear in the provided text; claims rest on direct empirical measurements of accuracy changes across surface-form categories. Frequency analyses are post-hoc correlations, not inputs that force the main results. The study is self-contained against external benchmarks (model outputs on the new dataset) with no self-citation load-bearing or self-definitional steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The work rests on the assumption that redirect-based surface forms provide an unbiased way to isolate memorization effects and that standard entity QA probes factual knowledge without other confounds.

invented entities (1)

RedirectQA dataset no independent evidence
purpose: To link Wikidata triples with categorized alternative surface forms for testing surface-conditioned memorization
Newly constructed resource for the study; no independent external validation mentioned

pith-pipeline@v0.9.0 · 5525 in / 1105 out tokens · 28866 ms · 2026-05-09T21:51:12.744115+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

[1]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =

Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models , author =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =

work page 2024
[2]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

work page
[3]

Transactions of the Association for Computational Linguistics , volume =

Natural Questions: A Benchmark for Question Answering Research , author =. Transactions of the Association for Computational Linguistics , volume =

work page
[4]

Proceedings of the 40th International Conference on Machine Learning , pages =

Large language models struggle to learn long-tail knowledge , author =. Proceedings of the 40th International Conference on Machine Learning , pages =

work page
[5]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke , booktitle =. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

work page
[6]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =

Simple Entity-Centric Questions Challenge Dense Retrievers , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2021
[7]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

How Much Knowledge Can You Pack Into the Parameters of a Language Model? , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2020
[8]

Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and Skowron, Aviya and Sutawika, Lintang and Van Der Wal, Oskar , booktitle =

work page
[9]

Proceedings of the 30th USENIX Security Symposium , pages =

Extracting Training Data from Large Language Models , author =. Proceedings of the 30th USENIX Security Symposium , pages =

work page
[10]

Proceedings of the Eleventh International Conference on Learning Representations , year=

Quantifying Memorization Across Neural Language Models , author =. Proceedings of the Eleventh International Conference on Learning Representations , year=

work page
[11]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

A Multi-Perspective Analysis of Memorization in Large Language Models , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2024
[12]

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing , pages =

Language Models as Knowledge Bases? , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing , pages =

work page 2019
[13]

Proceedings of the Eleventh International Conference on Learning Representations , year =

Generate rather than Retrieve: Large Language Models are Strong Context Generators , author =. Proceedings of the Eleventh International Conference on Learning Representations , year =

work page
[14]

Distinguishing Ignorance from Error in

Adi Simhi and Jonathan Herzig and Idan Szpektor and Yonatan Belinkov , journal =. Distinguishing Ignorance from Error in

work page
[15]

The Twelfth International Conference on Learning Representations , year=

Large Language Models Are Not Robust Multiple Choice Selectors , author=. The Twelfth International Conference on Learning Representations , year=

work page
[16]

Does Refusal Training in

Maksym Andriushchenko and Nicolas Flammarion , booktitle=. Does Refusal Training in. 2025 , url=

work page 2025
[17]

and Jakob, Max and Garc\'

Mendes, Pablo N. and Jakob, Max and Garc\'. DBpedia spotlight: shedding light on the web of documents , year =. Proceedings of the 7th International Conference on Semantic Systems , pages =. doi:10.1145/2063518.2063519 , abstract =

work page doi:10.1145/2063518.2063519
[18]

2020 , eprint=

The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. 2020 , eprint=

work page 2020
[19]

Bojanowski, E

Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics. 2017. doi:10.1162/tacl_a_00051

work page doi:10.1162/tacl_a_00051 2017
[20]

Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

Sakai, Yusuke and Nohejl, Adam and Hang, Jiangnan and Kamigaito, Hidetaka and Watanabe, Taro. Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.31

work page doi:10.18653/v1/2024.blackboxnlp-1.31 2024
[21]

2025 , eprint=

Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison , author=. 2025 , eprint=

work page 2025
[22]

2024 , eprint=

2 OLMo 2 Furious , author=. 2024 , eprint=

work page 2024
[23]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[24]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[25]

2024 , url =

OpenAI , title =. 2024 , url =

work page 2024
[26]

Steiger, J. H. , title =. Psychological Bulletin , year =

work page
[27]

, title =

Hoerger, M. , title =. 2013 , url =

work page 2013
[28]

Measuring and Improving Consistency in Pretrained Language Models

Elazar, Yanai and Kassner, Nora and Ravfogel, Shauli and Ravichander, Abhilasha and Hovy, Eduard and Sch. Measuring and Improving Consistency in Pretrained Language Models. Transactions of the Association for Computational Linguistics. 2021. doi:10.1162/tacl_a_00410

work page doi:10.1162/tacl_a_00410 2021
[29]

Are Red Roses Red? Evaluating Consistency of Question-Answering Models

Ribeiro, Marco Tulio and Guestrin, Carlos and Singh, Sameer. Are Red Roses Red? Evaluating Consistency of Question-Answering Models. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1621

work page doi:10.18653/v1/p19-1621 2019
[30]

2023 , eprint=

Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions , author=. 2023 , eprint=

work page 2023

[1] [1]

Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =

Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models , author =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =

work page 2024

[2] [2]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

work page

[3] [3]

Transactions of the Association for Computational Linguistics , volume =

Natural Questions: A Benchmark for Question Answering Research , author =. Transactions of the Association for Computational Linguistics , volume =

work page

[4] [4]

Proceedings of the 40th International Conference on Machine Learning , pages =

Large language models struggle to learn long-tail knowledge , author =. Proceedings of the 40th International Conference on Machine Learning , pages =

work page

[5] [5]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke , booktitle =. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

work page

[6] [6]

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =

Simple Entity-Centric Questions Challenge Dense Retrievers , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2021

[7] [7]

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

How Much Knowledge Can You Pack Into the Parameters of a Language Model? , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2020

[8] [8]

Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and Skowron, Aviya and Sutawika, Lintang and Van Der Wal, Oskar , booktitle =

work page

[9] [9]

Proceedings of the 30th USENIX Security Symposium , pages =

Extracting Training Data from Large Language Models , author =. Proceedings of the 30th USENIX Security Symposium , pages =

work page

[10] [10]

Proceedings of the Eleventh International Conference on Learning Representations , year=

Quantifying Memorization Across Neural Language Models , author =. Proceedings of the Eleventh International Conference on Learning Representations , year=

work page

[11] [11]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

A Multi-Perspective Analysis of Memorization in Large Language Models , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

work page 2024

[12] [12]

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing , pages =

Language Models as Knowledge Bases? , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing , pages =

work page 2019

[13] [13]

Proceedings of the Eleventh International Conference on Learning Representations , year =

Generate rather than Retrieve: Large Language Models are Strong Context Generators , author =. Proceedings of the Eleventh International Conference on Learning Representations , year =

work page

[14] [14]

Distinguishing Ignorance from Error in

Adi Simhi and Jonathan Herzig and Idan Szpektor and Yonatan Belinkov , journal =. Distinguishing Ignorance from Error in

work page

[15] [15]

The Twelfth International Conference on Learning Representations , year=

Large Language Models Are Not Robust Multiple Choice Selectors , author=. The Twelfth International Conference on Learning Representations , year=

work page

[16] [16]

Does Refusal Training in

Maksym Andriushchenko and Nicolas Flammarion , booktitle=. Does Refusal Training in. 2025 , url=

work page 2025

[17] [17]

and Jakob, Max and Garc\'

Mendes, Pablo N. and Jakob, Max and Garc\'. DBpedia spotlight: shedding light on the web of documents , year =. Proceedings of the 7th International Conference on Semantic Systems , pages =. doi:10.1145/2063518.2063519 , abstract =

work page doi:10.1145/2063518.2063519

[18] [18]

2020 , eprint=

The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. 2020 , eprint=

work page 2020

[19] [19]

Bojanowski, E

Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics. 2017. doi:10.1162/tacl_a_00051

work page doi:10.1162/tacl_a_00051 2017

[20] [20]

Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

Sakai, Yusuke and Nohejl, Adam and Hang, Jiangnan and Kamigaito, Hidetaka and Watanabe, Taro. Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.31

work page doi:10.18653/v1/2024.blackboxnlp-1.31 2024

[21] [21]

2025 , eprint=

Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison , author=. 2025 , eprint=

work page 2025

[22] [22]

2024 , eprint=

2 OLMo 2 Furious , author=. 2024 , eprint=

work page 2024

[23] [23]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[24] [24]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024

[25] [25]

2024 , url =

OpenAI , title =. 2024 , url =

work page 2024

[26] [26]

Steiger, J. H. , title =. Psychological Bulletin , year =

work page

[27] [27]

, title =

Hoerger, M. , title =. 2013 , url =

work page 2013

[28] [28]

Measuring and Improving Consistency in Pretrained Language Models

Elazar, Yanai and Kassner, Nora and Ravfogel, Shauli and Ravichander, Abhilasha and Hovy, Eduard and Sch. Measuring and Improving Consistency in Pretrained Language Models. Transactions of the Association for Computational Linguistics. 2021. doi:10.1162/tacl_a_00410

work page doi:10.1162/tacl_a_00410 2021

[29] [29]

Are Red Roses Red? Evaluating Consistency of Question-Answering Models

Ribeiro, Marco Tulio and Guestrin, Carlos and Singh, Sameer. Are Red Roses Red? Evaluating Consistency of Question-Answering Models. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1621

work page doi:10.18653/v1/p19-1621 2019

[30] [30]

2023 , eprint=

Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions , author=. 2023 , eprint=

work page 2023