Remembering Unequally: Global and Disciplinary Bias in LLM Reconstruction of Scholarly Coauthor Lists

Afra Mashhadi; Ghazal Kalhor

arxiv: 2511.00476 · v2 · submitted 2025-11-01 · 💻 cs.CL

Remembering Unequally: Global and Disciplinary Bias in LLM Reconstruction of Scholarly Coauthor Lists

Ghazal Kalhor , Afra Mashhadi This is my paper

Pith reviewed 2026-05-18 01:44 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM memorizationcoauthor reconstructionscholarly biasacademic networksdisciplinary differencesregional inequalityAI fairnessresearch discovery

0 comments

The pith

Large language models reconstruct coauthor lists with a clear bias toward highly cited researchers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are now used to answer questions about who works with whom in science. This paper tests whether three leading models can accurately recall coauthor lists or whether they simply reproduce patterns from their training data. The results show that the models consistently over-represent already prominent scholars when measured against real publication records. The imbalance is not uniform: clinical medicine and some regions in Africa produce more even lists. If the pattern holds, reliance on these models for academic search will tend to widen the gap between visible and less-visible researchers.

Core claim

When prompted to list coauthors, DeepSeek R1, Llama 4 Scout, and Mixtral 8x7B produce lists that systematically favor highly cited researchers relative to bibliographic reference data. The same models generate more balanced lists for certain disciplines such as Clinical Medicine and for researchers in parts of Africa.

What carries the argument

Direct comparison of LLM-generated coauthor lists against bibliographic reference data to measure memorization bias.

If this is right

Scholarly search interfaces that rely on LLMs will tend to surface already prominent names more often than others.
Queries about researchers in clinical medicine or certain African regions are likely to return more representative results than queries in other fields or regions.
Developers of academic tools will need to audit memorization effects before deploying LLM-based relationship features.
Uneven recall could slow the discovery of collaborative work involving less-cited scholars.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interfaces that surface LLM-generated coauthor suggestions could be supplemented with explicit links to public citation databases to offset the bias.
Training data that under-samples certain regions or disciplines will continue to shape what LLMs treat as normal academic networks.
Similar tests could be run on other relational tasks, such as suggesting collaborators or recalling citation chains, to check whether the same visibility bias appears.

Load-bearing premise

Bibliographic reference data serves as a complete and unbiased record of actual coauthor relationships.

What would settle it

Collecting self-reported coauthor lists from a diverse sample of researchers and comparing them directly to the same LLM outputs would show whether the observed bias survives outside the reference dataset.

read the original abstract

Ongoing breakthroughs in large language models (LLMs) are reshaping scholarly search and discovery interfaces. While these systems offer new possibilities for navigating scientific knowledge, they also raise concerns about fairness and representational bias rooted in the models' memorized training data. As LLMs are increasingly used to answer queries about researchers and research communities, their ability to accurately reconstruct scholarly coauthor lists becomes an important but underexamined issue. In this study, we investigate how memorization in LLMs affects the reconstruction of coauthor lists and whether this process reflects existing inequalities across academic disciplines and world regions. We evaluate three prominent models, DeepSeek R1, Llama 4 Scout, and Mixtral 8x7B, by comparing their generated coauthor lists against bibliographic reference data. Our analysis reveals a systematic advantage for highly cited researchers, indicating that LLM memorization disproportionately favors already visible scholars. However, this pattern is not uniform: certain disciplines, such as Clinical Medicine, and some regions, including parts of Africa, exhibit more balanced reconstruction outcomes. These findings highlight both the risks and limitations of relying on LLM-generated relational knowledge in scholarly discovery contexts and emphasize the need for careful auditing of memorization-driven biases in LLM-based systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper investigates how memorization in LLMs affects reconstruction of scholarly coauthor lists, evaluating DeepSeek R1, Llama 4 Scout, and Mixtral 8x7B against bibliographic reference data. It reports a systematic advantage for highly cited researchers, indicating disproportionate favoritism toward visible scholars, while noting more balanced outcomes in disciplines such as Clinical Medicine and regions including parts of Africa.

Significance. If the attribution to LLM memorization holds after addressing potential confounds in the reference data, the findings would provide useful empirical evidence on fairness risks when LLMs are used for scholarly discovery and relational queries. The work contributes to auditing memorization-driven biases in AI systems applied to academic contexts.

major comments (2)

Abstract: The central claim that LLM reconstruction exhibits a systematic advantage for highly cited researchers requires that differences in fidelity versus bibliographic lists can be attributed to memorization. This holds only if the reference data (e.g., Scopus or Web of Science) exhibits comparable completeness across citation strata and regions; lower coverage for low-citation or African-institution authors would produce the observed pattern even without differential LLM bias.
Abstract: The reported more balanced reconstruction outcomes in Clinical Medicine and parts of Africa could reflect domain- or region-specific differences in bibliographic coverage rather than reduced LLM bias. Without explicit coverage audits, matching criteria, or statistical controls for reference completeness, the non-uniformity cannot be isolated from ground-truth artifacts.

minor comments (1)

Abstract: The description of the three models and comparison procedure would benefit from a brief statement of sample size, query formulation, and exact matching method to allow readers to assess reproducibility from the outset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and for highlighting important potential confounds related to bibliographic database coverage. We address each major comment below and have revised the manuscript to incorporate additional analyses, controls, and qualifications that strengthen the link between observed patterns and LLM memorization while acknowledging data limitations.

read point-by-point responses

Referee: Abstract: The central claim that LLM reconstruction exhibits a systematic advantage for highly cited researchers requires that differences in fidelity versus bibliographic lists can be attributed to memorization. This holds only if the reference data (e.g., Scopus or Web of Science) exhibits comparable completeness across citation strata and regions; lower coverage for low-citation or African-institution authors would produce the observed pattern even without differential LLM bias.

Authors: We agree that differential completeness in the reference data across citation strata and regions is a plausible alternative explanation that must be ruled out or controlled for before attributing discrepancies primarily to memorization. In the revised manuscript we have added a dedicated subsection to the Methods that reports coverage statistics (proportion of authors with verifiable coauthor lists) broken down by citation quartile and by region. We further include regression models that control for the number of indexed publications per author as a proxy for database visibility. These controls show that the systematic advantage for highly cited researchers persists at statistically significant levels. We have also revised the abstract to describe the advantage as 'consistent with memorization effects after accounting for reference-data coverage.' revision: yes
Referee: Abstract: The reported more balanced reconstruction outcomes in Clinical Medicine and parts of Africa could reflect domain- or region-specific differences in bibliographic coverage rather than reduced LLM bias. Without explicit coverage audits, matching criteria, or statistical controls for reference completeness, the non-uniformity cannot be isolated from ground-truth artifacts.

Authors: We acknowledge that domain- and region-specific coverage differences could produce the appearance of more balanced outcomes without any change in LLM behavior. The revised version now contains explicit coverage audits that compare the fraction of authors with complete versus partial coauthor information in the reference data for Clinical Medicine versus other disciplines and for African institutions versus other regions. We additionally report author-matching criteria (disambiguation thresholds) and include interaction terms between discipline/region and coverage metrics in our statistical models. These checks indicate that the relative balance in the noted fields and regions is not fully explained by coverage artifacts. The abstract and results sections have been updated to summarize these controls. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison to external bibliographic data

full rationale

The paper performs an empirical evaluation by prompting LLMs to reconstruct coauthor lists and directly comparing outputs against independent bibliographic reference data. The central claim of systematic advantage for highly cited researchers rests on observed differences in reconstruction fidelity across citation strata and regions. No equations, fitted parameters, self-definitional steps, or load-bearing self-citations appear in the provided derivation; the reference data serves as an external benchmark rather than an input that is redefined or predicted from the model outputs themselves. The analysis is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that bibliographic records constitute an objective external benchmark and that differences between LLM output and those records can be attributed to memorization rather than other factors such as query formulation or model decoding choices.

axioms (1)

domain assumption Bibliographic databases provide a complete and unbiased record of actual coauthorships.
Invoked when the paper treats reference data as ground truth for measuring reconstruction accuracy.

pith-pipeline@v0.9.0 · 5749 in / 1307 out tokens · 26755 ms · 2026-05-18T01:44:59.206696+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

[1]

arXiv preprint arXiv:240201788

Agarwal S, Laradji IH, Charlin L, et al (2024) Litllm: A toolkit for scientific literature review. arXiv preprint arXiv:240201788

work page 2024
[2]

arXiv preprint arXiv:240417663

Alperin JP, Portenoy J, Demes K, et al (2024) An analysis of the suitability of openalex for bibliometric analyses. arXiv preprint arXiv:240417663

work page 2024
[3]

Bombieri M, Fiorini P, Ponzetto SP, et al (2024) Do llms dream of ontologies? arXiv preprint arXiv:240114931

work page 2024
[4]

Applied Network Science 4(1):1–17

Bravo-Hermsdorff G, Felso V, Ray E, et al (2019) Gender and collaboration patterns in a temporal scientific authorship network. Applied Network Science 4(1):1–17

work page 2019
[5]

In: 30th USENIX security symposium (USENIX Security 21), pp 2633–2650

Carlini N, Tramer F, Wallace E, et al (2021) Extracting training data from large language models. In: 30th USENIX security symposium (USENIX Security 21), pp 2633–2650

work page 2021
[6]

In: The Eleventh International Conference on Learning Representations

Carlini N, Ippolito D, Jagielski M, et al (2022) Quantifying memorization across neural language models. In: The Eleventh International Conference on Learning Representations

work page 2022
[7]

https://doi.org/10.5281/ zenodo.5764801, URL https://github.com/scholarly-python-package/scholarly

Cholewiak SA, Ipeirotis P, Silva V, et al (2021) SCHOLARLY: Simple access to Google Scholar authors and citation using Python. https://doi.org/10.5281/ zenodo.5764801, URL https://github.com/scholarly-python-package/scholarly

work page 2021
[8]

Scientometrics 130(4):2475–2492

Culbert JH, Hobert A, Jahn N, et al (2025) Reference coverage analysis of openalex compared to web of science and scopus. Scientometrics 130(4):2475–2492

work page 2025
[9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Guo D, Yang D, et al (2025) Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning. arXiv preprint arXiv:250112948 URL https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Journal of the Knowledge Economy 14(2):1503– 1521

Diop S, Asongu SA (2023) Research productivity: Trend and comparative anal- yses by regions and continents. Journal of the Knowledge Economy 14(2):1503– 1521

work page 2023
[11]

https://developers.google

Google (2024) Google maps platform documentation. https://developers.google. com/maps/documentation, accessed: 2025-05-23 23

work page 2024
[12]

Plos one 16(9):e0256997

Grodzinski N, Grodzinski B, Davies BM (2021) Can co-authorship networks be used to predict author research impact? a machine-learning based analysis within the field of degenerative cervical myelopathy research. Plos one 16(9):e0256997

work page 2021
[13]

arXiv preprint arXiv:240713993

Haryanto CY (2024) Llassist: Simple tools for automating literature review using large language models. arXiv preprint arXiv:240713993

work page 2024
[14]

Hayes J, Swanberg M, Chaudhari H, et al (2025) Measuring memorization in language models via probabilistic extraction. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp 9266–9291

work page 2025
[15]

Bull Soc Vaudoise Sci Nat 37:547–579

Jaccard P (1901) ´Etude comparative de la distribution florale dans une portion des alpes et des jura. Bull Soc Vaudoise Sci Nat 37:547–579

work page 1901
[16]

Mixtral of Experts

Jiang AQ, Sablayrolles A, Roux A, et al (2024) Mixtral of experts. arXiv preprint arXiv:240104088 URL https://arxiv.org/abs/2401.04088

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Applied Network Science 7(1):21

Kalhor G, Asadi Sarijalou A, Sharifi Sadr N, et al (2022) A new insight to the analysis of co-authorship in google scholar. Applied Network Science 7(1):21

work page 2022
[18]

EPJ Data Science 14(1):1–33

Kalhor G, Ali S, Mashhadi A (2025) Measuring biases in ai-generated co- authorship networks. EPJ Data Science 14(1):1–33

work page 2025
[19]

The Journal of the Canadian Health Libraries Association 44(1):15

Kung JY (2023) Elicit. The Journal of the Canadian Health Libraries Association 44(1):15

work page 2023
[20]

arXiv preprint arXiv:240302574

Li Y, Chen L, Liu A, et al (2024) Chatcite: Llm agent with human workflow guidance for comparative literature summary. arXiv preprint arXiv:240302574

work page 2024
[21]

Tapuya: Latin American Science, Technology and Society 5(1):2037819

L´ opez-Aguirre C, Far´ ıas D (2022) The mirage of scientific productivity and how women are left behind: the colombian case. Tapuya: Latin American Science, Technology and Society 5(1):2037819

work page 2022
[22]

In: Ku LW, Martins A, Srikumar V (eds) Findings of the Association for Computational Linguistics ACL 2024

Luong T, Le TT, Ngo L, et al (2024) Realistic evaluation of toxicity in large language models. In: Ku LW, Martins A, Srikumar V (eds) Findings of the Association for Computational Linguistics ACL 2024. Association for Computa- tional Linguistics, Bangkok, Thailand and virtual meeting, pp 1038–1047, https: //doi.org/10.18653/v1/2024.findings-acl.61, URL ht...

work page doi:10.18653/v1/2024.findings-acl.61 2024
[23]

In: International Workshop on Complex Networks, Springer, pp 120–131 24

Macedo M, Jaramillo AM, Menezes R (2023) Academic mobility as a driver of productivity: A gender-centric approach. In: International Workshop on Complex Networks, Springer, pp 120–131 24

work page 2023
[24]

In: Muresan S, Nakov P, Villavicencio A (eds) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Magar I, Schwartz R (2022) Data contamination: From memorization to exploitation. In: Muresan S, Nakov P, Villavicencio A (eds) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Dublin, Ireland, pp 157–165, https://doi.org/10.18653/v1/2022.acl-short....

work page doi:10.18653/v1/2022.acl-short.18 2022
[25]

arXiv preprint arXiv:240202680

Manvi R, Khanna S, Burke M, et al (2024) Large language models are geograph- ically biased. arXiv preprint arXiv:240202680

work page 2024
[26]

Meta AI Blog, URL https://ai.meta.com/blog/ llama-4-multimodal-intelligence/, april 5, 2025

Meta AI (2025) The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. Meta AI Blog, URL https://ai.meta.com/blog/ llama-4-multimodal-intelligence/, april 5, 2025

work page 2025
[27]

arXiv preprint arXiv:231117035

Nasr M, Carlini N, Hayase J, et al (2023) Scalable extraction of training data from (production) language models. arXiv preprint arXiv:231117035

work page 2023
[28]

URL https://arxiv

Nguyen TT, Wilson C, Dalins J (2023) Fine-tuning llama 2 large language models for detecting online sexual predatory chats and abusive texts. URL https://arxiv. org/abs/2308.14683, 2308.14683

work page arXiv 2023
[29]

arXiv preprint arXiv:220501833

Priem J, Piwowar H, Orr R (2022) Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:220501833

work page 2022
[30]

arXiv preprint arXiv:250515501 URL https://arxiv.org/abs/2505.15501

Ranaldi F, Zugarini A, Ranaldi L, et al (2025) Protoknowledge shapes behaviour of llms in downstream tasks: Memorization and generalization with knowledge graphs. arXiv preprint arXiv:250515501 URL https://arxiv.org/abs/2505.15501

work page arXiv 2025
[31]

arXiv preprint arXiv:240900159 URL https://arxiv.org/abs/2409.00159

Richardeau G, Chali S, Le Merrer E, et al (2024) Llms prompted for graphs: Hallucinations and generative capabilities. arXiv preprint arXiv:240900159 URL https://arxiv.org/abs/2409.00159

work page arXiv 2024
[32]

In: The Eleventh International Conference on Learning Representations

Saparov A, He H (2023) Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In: The Eleventh International Conference on Learning Representations

work page 2023
[33]

LLaMA: Open and Efficient Foundation Language Models

Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. ArXiv abs/2302.13971. URL https://api.semanticscholar.org/ CorpusID:257219404

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

arXiv preprint arXiv:240714985

Wang X, Antoniades A, Elazar Y, et al (2024) Generalization vs memorization: Tracing language models’ capabilities back to pretraining data. arXiv preprint arXiv:240714985

work page 2024
[35]

memorization: Tracing language models’ capabilities back to pretraining data

Wang X, Antoniades A, Elazar Y, et al (2025) Generalization v.s. memorization: Tracing language models’ capabilities back to pretraining data. arXiv preprint 25 arXiv:240714985 URL https://arxiv.org/abs/2407.14985 26

work page arXiv 2025

[1] [1]

arXiv preprint arXiv:240201788

Agarwal S, Laradji IH, Charlin L, et al (2024) Litllm: A toolkit for scientific literature review. arXiv preprint arXiv:240201788

work page 2024

[2] [2]

arXiv preprint arXiv:240417663

Alperin JP, Portenoy J, Demes K, et al (2024) An analysis of the suitability of openalex for bibliometric analyses. arXiv preprint arXiv:240417663

work page 2024

[3] [3]

Bombieri M, Fiorini P, Ponzetto SP, et al (2024) Do llms dream of ontologies? arXiv preprint arXiv:240114931

work page 2024

[4] [4]

Applied Network Science 4(1):1–17

Bravo-Hermsdorff G, Felso V, Ray E, et al (2019) Gender and collaboration patterns in a temporal scientific authorship network. Applied Network Science 4(1):1–17

work page 2019

[5] [5]

In: 30th USENIX security symposium (USENIX Security 21), pp 2633–2650

Carlini N, Tramer F, Wallace E, et al (2021) Extracting training data from large language models. In: 30th USENIX security symposium (USENIX Security 21), pp 2633–2650

work page 2021

[6] [6]

In: The Eleventh International Conference on Learning Representations

Carlini N, Ippolito D, Jagielski M, et al (2022) Quantifying memorization across neural language models. In: The Eleventh International Conference on Learning Representations

work page 2022

[7] [7]

https://doi.org/10.5281/ zenodo.5764801, URL https://github.com/scholarly-python-package/scholarly

Cholewiak SA, Ipeirotis P, Silva V, et al (2021) SCHOLARLY: Simple access to Google Scholar authors and citation using Python. https://doi.org/10.5281/ zenodo.5764801, URL https://github.com/scholarly-python-package/scholarly

work page 2021

[8] [8]

Scientometrics 130(4):2475–2492

Culbert JH, Hobert A, Jahn N, et al (2025) Reference coverage analysis of openalex compared to web of science and scopus. Scientometrics 130(4):2475–2492

work page 2025

[9] [9]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI, Guo D, Yang D, et al (2025) Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning. arXiv preprint arXiv:250112948 URL https://arxiv.org/abs/2501.12948

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Journal of the Knowledge Economy 14(2):1503– 1521

Diop S, Asongu SA (2023) Research productivity: Trend and comparative anal- yses by regions and continents. Journal of the Knowledge Economy 14(2):1503– 1521

work page 2023

[11] [11]

https://developers.google

Google (2024) Google maps platform documentation. https://developers.google. com/maps/documentation, accessed: 2025-05-23 23

work page 2024

[12] [12]

Plos one 16(9):e0256997

Grodzinski N, Grodzinski B, Davies BM (2021) Can co-authorship networks be used to predict author research impact? a machine-learning based analysis within the field of degenerative cervical myelopathy research. Plos one 16(9):e0256997

work page 2021

[13] [13]

arXiv preprint arXiv:240713993

Haryanto CY (2024) Llassist: Simple tools for automating literature review using large language models. arXiv preprint arXiv:240713993

work page 2024

[14] [14]

Hayes J, Swanberg M, Chaudhari H, et al (2025) Measuring memorization in language models via probabilistic extraction. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp 9266–9291

work page 2025

[15] [15]

Bull Soc Vaudoise Sci Nat 37:547–579

Jaccard P (1901) ´Etude comparative de la distribution florale dans une portion des alpes et des jura. Bull Soc Vaudoise Sci Nat 37:547–579

work page 1901

[16] [16]

Mixtral of Experts

Jiang AQ, Sablayrolles A, Roux A, et al (2024) Mixtral of experts. arXiv preprint arXiv:240104088 URL https://arxiv.org/abs/2401.04088

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

Applied Network Science 7(1):21

Kalhor G, Asadi Sarijalou A, Sharifi Sadr N, et al (2022) A new insight to the analysis of co-authorship in google scholar. Applied Network Science 7(1):21

work page 2022

[18] [18]

EPJ Data Science 14(1):1–33

Kalhor G, Ali S, Mashhadi A (2025) Measuring biases in ai-generated co- authorship networks. EPJ Data Science 14(1):1–33

work page 2025

[19] [19]

The Journal of the Canadian Health Libraries Association 44(1):15

Kung JY (2023) Elicit. The Journal of the Canadian Health Libraries Association 44(1):15

work page 2023

[20] [20]

arXiv preprint arXiv:240302574

Li Y, Chen L, Liu A, et al (2024) Chatcite: Llm agent with human workflow guidance for comparative literature summary. arXiv preprint arXiv:240302574

work page 2024

[21] [21]

Tapuya: Latin American Science, Technology and Society 5(1):2037819

L´ opez-Aguirre C, Far´ ıas D (2022) The mirage of scientific productivity and how women are left behind: the colombian case. Tapuya: Latin American Science, Technology and Society 5(1):2037819

work page 2022

[22] [22]

In: Ku LW, Martins A, Srikumar V (eds) Findings of the Association for Computational Linguistics ACL 2024

Luong T, Le TT, Ngo L, et al (2024) Realistic evaluation of toxicity in large language models. In: Ku LW, Martins A, Srikumar V (eds) Findings of the Association for Computational Linguistics ACL 2024. Association for Computa- tional Linguistics, Bangkok, Thailand and virtual meeting, pp 1038–1047, https: //doi.org/10.18653/v1/2024.findings-acl.61, URL ht...

work page doi:10.18653/v1/2024.findings-acl.61 2024

[23] [23]

In: International Workshop on Complex Networks, Springer, pp 120–131 24

Macedo M, Jaramillo AM, Menezes R (2023) Academic mobility as a driver of productivity: A gender-centric approach. In: International Workshop on Complex Networks, Springer, pp 120–131 24

work page 2023

[24] [24]

In: Muresan S, Nakov P, Villavicencio A (eds) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Magar I, Schwartz R (2022) Data contamination: From memorization to exploitation. In: Muresan S, Nakov P, Villavicencio A (eds) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Dublin, Ireland, pp 157–165, https://doi.org/10.18653/v1/2022.acl-short....

work page doi:10.18653/v1/2022.acl-short.18 2022

[25] [25]

arXiv preprint arXiv:240202680

Manvi R, Khanna S, Burke M, et al (2024) Large language models are geograph- ically biased. arXiv preprint arXiv:240202680

work page 2024

[26] [26]

Meta AI Blog, URL https://ai.meta.com/blog/ llama-4-multimodal-intelligence/, april 5, 2025

Meta AI (2025) The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. Meta AI Blog, URL https://ai.meta.com/blog/ llama-4-multimodal-intelligence/, april 5, 2025

work page 2025

[27] [27]

arXiv preprint arXiv:231117035

Nasr M, Carlini N, Hayase J, et al (2023) Scalable extraction of training data from (production) language models. arXiv preprint arXiv:231117035

work page 2023

[28] [28]

URL https://arxiv

Nguyen TT, Wilson C, Dalins J (2023) Fine-tuning llama 2 large language models for detecting online sexual predatory chats and abusive texts. URL https://arxiv. org/abs/2308.14683, 2308.14683

work page arXiv 2023

[29] [29]

arXiv preprint arXiv:220501833

Priem J, Piwowar H, Orr R (2022) Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:220501833

work page 2022

[30] [30]

arXiv preprint arXiv:250515501 URL https://arxiv.org/abs/2505.15501

Ranaldi F, Zugarini A, Ranaldi L, et al (2025) Protoknowledge shapes behaviour of llms in downstream tasks: Memorization and generalization with knowledge graphs. arXiv preprint arXiv:250515501 URL https://arxiv.org/abs/2505.15501

work page arXiv 2025

[31] [31]

arXiv preprint arXiv:240900159 URL https://arxiv.org/abs/2409.00159

Richardeau G, Chali S, Le Merrer E, et al (2024) Llms prompted for graphs: Hallucinations and generative capabilities. arXiv preprint arXiv:240900159 URL https://arxiv.org/abs/2409.00159

work page arXiv 2024

[32] [32]

In: The Eleventh International Conference on Learning Representations

Saparov A, He H (2023) Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In: The Eleventh International Conference on Learning Representations

work page 2023

[33] [33]

LLaMA: Open and Efficient Foundation Language Models

Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. ArXiv abs/2302.13971. URL https://api.semanticscholar.org/ CorpusID:257219404

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

arXiv preprint arXiv:240714985

Wang X, Antoniades A, Elazar Y, et al (2024) Generalization vs memorization: Tracing language models’ capabilities back to pretraining data. arXiv preprint arXiv:240714985

work page 2024

[35] [35]

memorization: Tracing language models’ capabilities back to pretraining data

Wang X, Antoniades A, Elazar Y, et al (2025) Generalization v.s. memorization: Tracing language models’ capabilities back to pretraining data. arXiv preprint 25 arXiv:240714985 URL https://arxiv.org/abs/2407.14985 26

work page arXiv 2025