pith. sign in

arxiv: 2604.21882 · v1 · submitted 2026-04-23 · 💻 cs.CL

Revisiting Non-Verbatim Memorization in Large Language Models: The Role of Entity Surface Forms

Pith reviewed 2026-05-09 21:51 UTC · model grok-4.3

classification 💻 cs.CL
keywords non-verbatim memorizationentity surface formslarge language modelsfactual recallWikipedia redirectsknowledge evaluationRedirectQA dataset
0
0 comments X

The pith

Factual recall in large language models changes when the same entity is referred to by different names or spellings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether large language models store facts about entities in a way that is tied to specific names or more abstractly. It builds a dataset that pairs each fact with multiple surface forms drawn from Wikipedia redirects, covering aliases, abbreviations, spelling variants, and common misspellings. Experiments across thirteen models show that accuracy often drops or rises when only the entity name changes, and that small spelling differences are handled more reliably than large lexical shifts like abbreviations. Both the overall frequency of an entity and the frequency of its particular name correlate with performance, with entity frequency adding explanatory power beyond surface frequency alone. This indicates that memorization sits between being locked to one phrasing and being fully independent of phrasing, so single-name tests may not fully reveal what models actually know.

Core claim

The authors introduce RedirectQA, an entity-based QA dataset that associates Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms. Across 13 LLMs, prediction outcomes frequently change when only the entity surface form is altered. This inconsistency is category-dependent, with models more robust to minor orthographic variations than to larger lexical variations such as aliases and abbreviations. Frequency analyses indicate that both entity- and surface-level frequencies are associated with accuracy, and that entity frequency often contributes beyond surface frequency. Overall, the ar

What carries the argument

RedirectQA dataset, which uses Wikipedia redirect categories to pair factual triples with multiple surface forms of each entity and thereby isolate the effect of name variation on recall.

If this is right

  • Entity QA evaluations that rely on a single canonical name per entity may give an incomplete picture of factual memorization.
  • Models exhibit greater consistency across minor spelling variants than across aliases or abbreviations.
  • Both the popularity of an entity and the popularity of its specific surface form influence recall accuracy.
  • Entity frequency often accounts for additional variance in accuracy after surface frequency is controlled for.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Knowledge benchmarks should routinely include multiple surface forms per entity to measure non-verbatim memorization more reliably.
  • Training procedures that expose models to varied names for the same entities may reduce surface-form sensitivity in factual recall.
  • The same surface-form sensitivity could appear in other memorized content, such as events or relations, if tested with comparable variation.

Load-bearing premise

Wikipedia redirect categories can separate surface-form effects from other factors such as entity frequency or question difficulty without selection bias.

What would settle it

A controlled experiment that balances all surface forms for frequency and query difficulty and then checks whether the observed inconsistency across forms disappears.

Figures

Figures reproduced from arXiv: 2604.21882 by Hidetaka Kamigaito, Makoto Morishita, Naoki Shikoda, Ryo Fujii, Taro Watanabe, Yosuke Kishinami, Yuto Nishida.

Figure 1
Figure 1. Figure 1: Overview of the RedirectQA construction process: (1) Factual triples are collected from Wikidata. (2) Each subject entity is associated with canonical and redirect surface forms, together with redirect categories, using Wikipedia redirects. (3) Question realizations are generated from surface instances using relation-specific question templates. a surface instance with a relation-specific question template… view at source ↗
Figure 2
Figure 2. Figure 2: Prediction consistency between canonical and redirect surface forms on RedirectQA using the original [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Relationship between accuracy and entity/surface frequencies for Pythia-12B. Each point shows the mean [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prediction consistency between canonical and redirect surface forms on RedirectQA using the paraphrased [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

Understanding what kinds of factual knowledge large language models (LLMs) memorize is essential for evaluating their reliability and limitations. Entity-based QA is a common framework for analyzing non-verbatim memorization, but typical evaluations query each entity using a single canonical surface form, making it difficult to disentangle fact memorization from access through a particular name. We introduce RedirectQA, an entity-based QA dataset that uses Wikipedia redirect information to associate Wikidata factual triples with categorized surface forms for each entity, including alternative names, abbreviations, spelling variants, and common erroneous forms. Across 13 LLMs, we examine surface-conditioned factual memorization and find that prediction outcomes often change when only the entity surface form changes. This inconsistency is category-dependent: models are more robust to minor orthographic variations than to larger lexical variations such as aliases and abbreviations. Frequency analyses further suggest that both entity- and surface-level frequencies are associated with accuracy, and that entity frequency often contributes beyond surface frequency. Overall, factual memorization appears neither purely surface-specific nor fully surface-invariant, highlighting the importance of surface-form diversity in evaluating non-verbatim memorization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces RedirectQA, a dataset constructed from Wikipedia redirects that associates Wikidata factual triples with categorized surface forms (aliases, abbreviations, spelling variants, erroneous forms) for each entity. It evaluates factual QA performance across 13 LLMs and reports that prediction outcomes frequently change when only the entity surface form is varied, with greater robustness to minor orthographic changes than to larger lexical variations. Frequency analyses indicate correlations with both entity-level and surface-level frequencies, where entity frequency often contributes predictive power beyond surface frequency alone. The authors conclude that factual memorization is neither purely surface-specific nor fully surface-invariant, emphasizing the value of surface-form diversity in non-verbatim memorization evaluations.

Significance. If the central empirical patterns hold after addressing potential dataset confounds, the work would meaningfully advance LLM evaluation practices by showing that single-surface-form benchmarks can misestimate factual recall. The RedirectQA construction provides a scalable, real-world-derived resource for testing surface sensitivity, and the frequency results offer a starting point for disentangling memorization from access mechanisms. This could influence benchmark design in knowledge probing and reliability assessment, though its impact depends on the robustness of the category isolation.

major comments (2)
  1. [§3] §3 (RedirectQA Dataset Construction): The use of Wikipedia redirect categories to isolate surface-form effects risks selection bias, as redirect volume and type correlate with entity popularity (more frequent entities have richer redirect graphs) and 'erroneous' forms may cluster on harder-to-recall entities. Without explicit stratification, propensity matching, or multivariate controls for entity frequency and query difficulty (e.g., template complexity or answer rarity) across categories, the reported category-dependent robustness may reflect these imbalances rather than genuine surface-form sensitivity. The frequency analyses mentioned do not appear to include such controls.
  2. [§5] §5 (Experiments): The claim that prediction outcomes 'often change' with surface form and that models are 'more robust' to minor variations requires per-category accuracy tables with statistical tests (e.g., paired significance or effect sizes) and full model list with hyperparameters. The abstract reports patterns across 13 models, but without data splits, query phrasing controls, or ablation on template effects, it is difficult to verify that changes are attributable to surface forms rather than confounds.
minor comments (3)
  1. [Abstract] The abstract and early sections should explicitly list the 13 LLMs evaluated (with sizes and sources) rather than deferring to a later table, to allow immediate assessment of scope.
  2. [§2] Related work section could more directly cite prior studies on alias handling in entity linking and surface-form robustness in QA to better position the novelty of RedirectQA.
  3. [§5] Ensure all figures include error bars, sample sizes per category, and clear legends; some frequency correlation plots appear to lack these based on the described results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important considerations for strengthening our analysis of surface-form sensitivity in LLMs. We respond to each major comment below and commit to revisions that address the raised concerns.

read point-by-point responses
  1. Referee: [§3] §3 (RedirectQA Dataset Construction): The use of Wikipedia redirect categories to isolate surface-form effects risks selection bias, as redirect volume and type correlate with entity popularity (more frequent entities have richer redirect graphs) and 'erroneous' forms may cluster on harder-to-recall entities. Without explicit stratification, propensity matching, or multivariate controls for entity frequency and query difficulty (e.g., template complexity or answer rarity) across categories, the reported category-dependent robustness may reflect these imbalances rather than genuine surface-form sensitivity. The frequency analyses mentioned do not appear to include such controls.

    Authors: We agree that selection bias is a valid concern given the nature of Wikipedia redirects. Our current frequency analyses show associations with both entity- and surface-level frequencies, with entity frequency often providing additional predictive power. However, these do not fully control for confounds such as query difficulty or stratification by popularity. In the revised version, we will add explicit stratification by entity frequency bins and include multivariate regression controls that account for entity popularity, answer rarity, and template complexity to better isolate surface-form effects. revision: yes

  2. Referee: [§5] §5 (Experiments): The claim that prediction outcomes 'often change' with surface form and that models are 'more robust' to minor variations requires per-category accuracy tables with statistical tests (e.g., paired significance or effect sizes) and full model list with hyperparameters. The abstract reports patterns across 13 models, but without data splits, query phrasing controls, or ablation on template effects, it is difficult to verify that changes are attributable to surface forms rather than confounds.

    Authors: We acknowledge the need for greater statistical detail and transparency. The manuscript presents aggregated results across 13 LLMs demonstrating category-dependent changes in predictions. To address this, we will include in the revised §5: (1) per-category accuracy tables with paired t-tests or McNemar's tests for significance and effect sizes (e.g., Cohen's d); (2) the complete list of models and their hyperparameters; (3) details on data splits and query phrasing controls; and (4) ablations examining template effects to confirm attribution to surface forms. These additions will allow verification of the robustness patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset construction and standard QA evaluation

full rationale

The paper constructs RedirectQA by associating Wikidata triples with Wikipedia redirect surface forms and runs off-the-shelf LLM QA evaluations across 13 models. No equations, fitted parameters, or derivations appear in the provided text; claims rest on direct empirical measurements of accuracy changes across surface-form categories. Frequency analyses are post-hoc correlations, not inputs that force the main results. The study is self-contained against external benchmarks (model outputs on the new dataset) with no self-citation load-bearing or self-definitional steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The work rests on the assumption that redirect-based surface forms provide an unbiased way to isolate memorization effects and that standard entity QA probes factual knowledge without other confounds.

invented entities (1)
  • RedirectQA dataset no independent evidence
    purpose: To link Wikidata triples with categorized alternative surface forms for testing surface-conditioned memorization
    Newly constructed resource for the study; no independent external validation mentioned

pith-pipeline@v0.9.0 · 5525 in / 1105 out tokens · 28866 ms · 2026-05-09T21:51:12.744115+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

  1. [1]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =

    Retrieval Helps or Hurts? A Deeper Dive into the Efficacy of Retrieval Augmentation to Language Models , author =. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages =

  2. [2]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

    When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

  3. [3]

    Transactions of the Association for Computational Linguistics , volume =

    Natural Questions: A Benchmark for Question Answering Research , author =. Transactions of the Association for Computational Linguistics , volume =

  4. [4]

    Proceedings of the 40th International Conference on Machine Learning , pages =

    Large language models struggle to learn long-tail knowledge , author =. Proceedings of the 40th International Conference on Machine Learning , pages =

  5. [5]

    T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke , booktitle =. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

  6. [6]

    Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =

    Simple Entity-Centric Questions Challenge Dense Retrievers , author =. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages =

  7. [7]

    Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

    How Much Knowledge Can You Pack Into the Parameters of a Language Model? , author =. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages =

  8. [8]

    Biderman, Stella and Schoelkopf, Hailey and Anthony, Quentin and Bradley, Herbie and O'Brien, Kyle and Hallahan, Eric and Khan, Mohammad Aflah and Purohit, Shivanshu and Prashanth, USVSN Sai and Raff, Edward and Skowron, Aviya and Sutawika, Lintang and Van Der Wal, Oskar , booktitle =

  9. [9]

    Proceedings of the 30th USENIX Security Symposium , pages =

    Extracting Training Data from Large Language Models , author =. Proceedings of the 30th USENIX Security Symposium , pages =

  10. [10]

    Proceedings of the Eleventh International Conference on Learning Representations , year=

    Quantifying Memorization Across Neural Language Models , author =. Proceedings of the Eleventh International Conference on Learning Representations , year=

  11. [11]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

    A Multi-Perspective Analysis of Memorization in Large Language Models , author =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages =

  12. [12]

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing , pages =

    Language Models as Knowledge Bases? , author =. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing , pages =

  13. [13]

    Proceedings of the Eleventh International Conference on Learning Representations , year =

    Generate rather than Retrieve: Large Language Models are Strong Context Generators , author =. Proceedings of the Eleventh International Conference on Learning Representations , year =

  14. [14]

    Distinguishing Ignorance from Error in

    Adi Simhi and Jonathan Herzig and Idan Szpektor and Yonatan Belinkov , journal =. Distinguishing Ignorance from Error in

  15. [15]

    The Twelfth International Conference on Learning Representations , year=

    Large Language Models Are Not Robust Multiple Choice Selectors , author=. The Twelfth International Conference on Learning Representations , year=

  16. [16]

    Does Refusal Training in

    Maksym Andriushchenko and Nicolas Flammarion , booktitle=. Does Refusal Training in. 2025 , url=

  17. [17]

    and Jakob, Max and Garc\'

    Mendes, Pablo N. and Jakob, Max and Garc\'. DBpedia spotlight: shedding light on the web of documents , year =. Proceedings of the 7th International Conference on Semantic Systems , pages =. doi:10.1145/2063518.2063519 , abstract =

  18. [18]

    2020 , eprint=

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling , author=. 2020 , eprint=

  19. [19]

    Bojanowski, E

    Bojanowski, Piotr and Grave, Edouard and Joulin, Armand and Mikolov, Tomas. Enriching Word Vectors with Subword Information. Transactions of the Association for Computational Linguistics. 2017. doi:10.1162/tacl_a_00051

  20. [20]

    Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

    Sakai, Yusuke and Nohejl, Adam and Hang, Jiangnan and Kamigaito, Hidetaka and Watanabe, Taro. Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP. 2024. doi:10.18653/v1/2024.blackboxnlp-1.31

  21. [21]

    2025 , eprint=

    Open-sci-ref-0.01: open and reproducible reference baselines for language model and dataset comparison , author=. 2025 , eprint=

  22. [22]

    2024 , eprint=

    2 OLMo 2 Furious , author=. 2024 , eprint=

  23. [23]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  24. [24]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  25. [25]

    2024 , url =

    OpenAI , title =. 2024 , url =

  26. [26]

    Steiger, J. H. , title =. Psychological Bulletin , year =

  27. [27]

    , title =

    Hoerger, M. , title =. 2013 , url =

  28. [28]

    Measuring and Improving Consistency in Pretrained Language Models

    Elazar, Yanai and Kassner, Nora and Ravfogel, Shauli and Ravichander, Abhilasha and Hovy, Eduard and Sch. Measuring and Improving Consistency in Pretrained Language Models. Transactions of the Association for Computational Linguistics. 2021. doi:10.1162/tacl_a_00410

  29. [29]

    Are Red Roses Red? Evaluating Consistency of Question-Answering Models

    Ribeiro, Marco Tulio and Guestrin, Carlos and Singh, Sameer. Are Red Roses Red? Evaluating Consistency of Question-Answering Models. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019. doi:10.18653/v1/P19-1621

  30. [30]

    2023 , eprint=

    Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions , author=. 2023 , eprint=