Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms
Pith reviewed 2026-05-09 21:51 UTC · model grok-4.3
The pith
Factual recall in large language models changes when the same entity is referred to by different names or spellings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce RedirectQA, an entity-based QA dataset that associates Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms. Across 13 LLMs, prediction outcomes frequently change when only the entity surface form is altered. This inconsistency is category-dependent, with models more robust to minor orthographic variations than to larger lexical variations such as aliases and abbreviations. Frequency analyses indicate that both entity- and surface-level frequencies are associated with accuracy, and that entity frequency often contributes beyond surface frequency. Overall, the ar
What carries the argument
RedirectQA dataset, which uses Wikipedia redirect categories to pair factual triples with multiple surface forms of each entity and thereby isolate the effect of name variation on recall.
If this is right
- Entity QA evaluations that rely on a single canonical name per entity may give an incomplete picture of factual memorization.
- Models exhibit greater consistency across minor spelling variants than across aliases or abbreviations.
- Both the popularity of an entity and the popularity of its specific surface form influence recall accuracy.
- Entity frequency often accounts for additional variance in accuracy after surface frequency is controlled for.
Where Pith is reading between the lines
- Knowledge benchmarks should routinely include multiple surface forms per entity to measure non-verbatim memorization more reliably.
- Training procedures that expose models to varied names for the same entities may reduce surface-form sensitivity in factual recall.
- The same surface-form sensitivity could appear in other memorized content, such as events or relations, if tested with comparable variation.
Load-bearing premise
Wikipedia redirect categories can separate surface-form effects from other factors such as entity frequency or question difficulty without selection bias.
What would settle it
A controlled experiment that balances all surface forms for frequency and query difficulty and then checks whether the observed inconsistency across forms disappears.
Figures
read the original abstract
Understanding what kinds of factual knowledge large language models (LLMs) memorize is essential for evaluating their reliability and limitations. Entity-based QA is a common framework for analyzing non-verbatim memorization, but typical evaluations query each entity using a single canonical surface form, making it difficult to disentangle fact memorization from access through a particular name. We introduce RedirectQA, an entity-based QA dataset that uses Wikipedia redirect information to associate Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms. Across 13 LLMs, we examine surface-conditioned factual memorization and find that prediction outcomes often change when only the entity surface form changes. This inconsistency is category-dependent: models are more robust to minor orthographic variations than to larger lexical variations such as aliases and abbreviations. Frequency analyses further suggest that both entity- and surface-level frequencies are associated with accuracy, and that entity frequency often contributes beyond surface frequency. Overall, factual memorization appears neither purely surface-specific nor fully surface-invariant, highlighting the importance of surface-form diversity in evaluating non-verbatim memorization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces RedirectQA, a dataset constructed from Wikipedia redirects that associates Wikidata factual triples with categorized surface forms (aliases, abbreviations, spelling variants, erroneous forms) for each entity. It evaluates factual QA performance across 13 LLMs and reports that prediction outcomes frequently change when only the entity surface form is varied, with greater robustness to minor orthographic changes than to larger lexical variations. Frequency analyses indicate correlations with both entity-level and surface-level frequencies, where entity frequency often contributes predictive power beyond surface frequency alone. The authors conclude that factual memorization is neither purely surface-specific nor fully surface-invariant, emphasizing the value of surface-form diversity in non-verbatim memorization evaluations.
Significance. If the central empirical patterns hold after addressing potential dataset confounds, the work would meaningfully advance LLM evaluation practices by showing that single-surface-form benchmarks can misestimate factual recall. The RedirectQA construction provides a scalable, real-world-derived resource for testing surface sensitivity, and the frequency results offer a starting point for disentangling memorization from access mechanisms. This could influence benchmark design in knowledge probing and reliability assessment, though its impact depends on the robustness of the category isolation.
major comments (2)
- [§3] §3 (RedirectQA Dataset Construction): The use of Wikipedia redirect categories to isolate surface-form effects risks selection bias, as redirect volume and type correlate with entity popularity (more frequent entities have richer redirect graphs) and 'erroneous' forms may cluster on harder-to-recall entities. Without explicit stratification, propensity matching, or multivariate controls for entity frequency and query difficulty (e.g., template complexity or answer rarity) across categories, the reported category-dependent robustness may reflect these imbalances rather than genuine surface-form sensitivity. The frequency analyses mentioned do not appear to include such controls.
- [§5] §5 (Experiments): The claim that prediction outcomes 'often change' with surface form and that models are 'more robust' to minor variations requires per-category accuracy tables with statistical tests (e.g., paired significance or effect sizes) and full model list with hyperparameters. The abstract reports patterns across 13 models, but without data splits, query phrasing controls, or ablation on template effects, it is difficult to verify that changes are attributable to surface forms rather than confounds.
minor comments (3)
- [Abstract] The abstract and early sections should explicitly list the 13 LLMs evaluated (with sizes and sources) rather than deferring to a later table, to allow immediate assessment of scope.
- [§2] Related work section could more directly cite prior studies on alias handling in entity linking and surface-form robustness in QA to better position the novelty of RedirectQA.
- [§5] Ensure all figures include error bars, sample sizes per category, and clear legends; some frequency correlation plots appear to lack these based on the described results.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important considerations for strengthening our analysis of surface-form sensitivity in LLMs. We respond to each major comment below and commit to revisions that address the raised concerns.
read point-by-point responses
-
Referee: [§3] §3 (RedirectQA Dataset Construction): The use of Wikipedia redirect categories to isolate surface-form effects risks selection bias, as redirect volume and type correlate with entity popularity (more frequent entities have richer redirect graphs) and 'erroneous' forms may cluster on harder-to-recall entities. Without explicit stratification, propensity matching, or multivariate controls for entity frequency and query difficulty (e.g., template complexity or answer rarity) across categories, the reported category-dependent robustness may reflect these imbalances rather than genuine surface-form sensitivity. The frequency analyses mentioned do not appear to include such controls.
Authors: We agree that selection bias is a valid concern given the nature of Wikipedia redirects. Our current frequency analyses show associations with both entity- and surface-level frequencies, with entity frequency often providing additional predictive power. However, these do not fully control for confounds such as query difficulty or stratification by popularity. In the revised version, we will add explicit stratification by entity frequency bins and include multivariate regression controls that account for entity popularity, answer rarity, and template complexity to better isolate surface-form effects. revision: yes
-
Referee: [§5] §5 (Experiments): The claim that prediction outcomes 'often change' with surface form and that models are 'more robust' to minor variations requires per-category accuracy tables with statistical tests (e.g., paired significance or effect sizes) and full model list with hyperparameters. The abstract reports patterns across 13 models, but without data splits, query phrasing controls, or ablation on template effects, it is difficult to verify that changes are attributable to surface forms rather than confounds.
Authors: We acknowledge the need for greater statistical detail and transparency. The manuscript presents aggregated results across 13 LLMs demonstrating category-dependent changes in predictions. To address this, we will include in the revised §5: (1) per-category accuracy tables with paired t-tests or McNemar's tests for significance and effect sizes (e.g., Cohen's d); (2) the complete list of models and their hyperparameters; (3) details on data splits and query phrasing controls; and (4) ablations examining template effects to confirm attribution to surface forms. These additions will allow verification of the robustness patterns. revision: yes
Circularity Check
No circularity: empirical dataset construction and standard QA evaluation
full rationale
The paper constructs RedirectQA by associating Wikidata triples with Wikipedia redirect surface forms and runs off-the-shelf LLM QA evaluations across 13 models. No equations, fitted parameters, or derivations appear in the provided text; claims rest on direct empirical measurements of accuracy changes across surface-form categories. Frequency analyses are post-hoc correlations, not inputs that force the main results. The study is self-contained against external benchmarks (model outputs on the new dataset) with no self-citation load-bearing or self-definitional steps.
Axiom & Free-Parameter Ledger
invented entities (1)
-
RedirectQA dataset
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models , author =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =
work page 2024
-
[2]
When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =
-
[3]
Transactions of the Association for Computational Linguistics , volume =
Natural Questions: A Benchmark for Question Answering Research , author =. Transactions of the Association for Computational Linguistics , volume =
-
[4]
Proceedings of the 40th International Conference on Machine Learning , pages =
Large language models struggle to learn long-tail knowledge , author =. Proceedings of the 40th International Conference on Machine Learning , pages =
-
[5]
T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke , booktitle =. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
-
[6]
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =
Simple Entity-Centric Questions Challenge Dense Retrievers , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =
work page 2021
-
[7]
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =
How Much Knowledge Can You Pack Into the Parameters of a Language Model? , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =
work page 2020
-
[8]
Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and Skowron, Aviya and Sutawika, Lintang and Van Der Wal, Oskar , booktitle =
-
[9]
Proceedings of the 30th USENIX Security Symposium , pages =
Extracting Training Data from Large Language Models , author =. Proceedings of the 30th USENIX Security Symposium , pages =
-
[10]
Proceedings of the Eleventh International Conference on Learning Representations , year=
Quantifying Memorization Across Neural Language Models , author =. Proceedings of the Eleventh International Conference on Learning Representations , year=
-
[11]
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =
A Multi-Perspective Analysis of Memorization in Large Language Models , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =
work page 2024
-
[12]
Language Models as Knowledge Bases? , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing , pages =
work page 2019
-
[13]
Proceedings of the Eleventh International Conference on Learning Representations , year =
Generate rather than Retrieve: Large Language Models are Strong Context Generators , author =. Proceedings of the Eleventh International Conference on Learning Representations , year =
-
[14]
Distinguishing Ignorance from Error in
Adi Simhi and Jonathan Herzig and Idan Szpektor and Yonatan Belinkov , journal =. Distinguishing Ignorance from Error in
-
[15]
The Twelfth International Conference on Learning Representations , year=
Large Language Models Are Not Robust Multiple Choice Selectors , author=. The Twelfth International Conference on Learning Representations , year=
-
[16]
Maksym Andriushchenko and Nicolas Flammarion , booktitle=. Does Refusal Training in. 2025 , url=
work page 2025
-
[17]
Mendes, Pablo N. and Jakob, Max and Garc\'. DBpedia spotlight: shedding light on the web of documents , year =. Proceedings of the 7th International Conference on Semantic Systems , pages =. doi:10.1145/2063518.2063519 , abstract =
-
[18]
The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. 2020 , eprint=
work page 2020
-
[19]
Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics. 2017. doi:10.1162/tacl_a_00051
-
[20]
Sakai, Yusuke and Nohejl, Adam and Hang, Jiangnan and Kamigaito, Hidetaka and Watanabe, Taro. Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.31
-
[21]
Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison , author=. 2025 , eprint=
work page 2025
- [22]
- [23]
- [24]
- [25]
-
[26]
Steiger, J. H. , title =. Psychological Bulletin , year =
- [27]
-
[28]
Measuring and Improving Consistency in Pretrained Language Models
Elazar, Yanai and Kassner, Nora and Ravfogel, Shauli and Ravichander, Abhilasha and Hovy, Eduard and Sch. Measuring and Improving Consistency in Pretrained Language Models. Transactions of the Association for Computational Linguistics. 2021. doi:10.1162/tacl_a_00410
-
[29]
Are Red Roses Red? Evaluating Consistency of Question-Answering Models
Ribeiro, Marco Tulio and Guestrin, Carlos and Singh, Sameer. Are Red Roses Red? Evaluating Consistency of Question-Answering Models. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1621
-
[30]
Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions , author=. 2023 , eprint=
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.