Remembering Unequally: Global and Disciplinary Bias in LLM Reconstruction of Scholarly Coauthor Lists
Pith reviewed 2026-05-18 01:44 UTC · model grok-4.3
The pith
Large language models reconstruct coauthor lists with a clear bias toward highly cited researchers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When prompted to list coauthors, DeepSeek R1, Llama 4 Scout, and Mixtral 8x7B produce lists that systematically favor highly cited researchers relative to bibliographic reference data. The same models generate more balanced lists for certain disciplines such as Clinical Medicine and for researchers in parts of Africa.
What carries the argument
Direct comparison of LLM-generated coauthor lists against bibliographic reference data to measure memorization bias.
If this is right
- Scholarly search interfaces that rely on LLMs will tend to surface already prominent names more often than others.
- Queries about researchers in clinical medicine or certain African regions are likely to return more representative results than queries in other fields or regions.
- Developers of academic tools will need to audit memorization effects before deploying LLM-based relationship features.
- Uneven recall could slow the discovery of collaborative work involving less-cited scholars.
Where Pith is reading between the lines
- Interfaces that surface LLM-generated coauthor suggestions could be supplemented with explicit links to public citation databases to offset the bias.
- Training data that under-samples certain regions or disciplines will continue to shape what LLMs treat as normal academic networks.
- Similar tests could be run on other relational tasks, such as suggesting collaborators or recalling citation chains, to check whether the same visibility bias appears.
Load-bearing premise
Bibliographic reference data serves as a complete and unbiased record of actual coauthor relationships.
What would settle it
Collecting self-reported coauthor lists from a diverse sample of researchers and comparing them directly to the same LLM outputs would show whether the observed bias survives outside the reference dataset.
read the original abstract
Ongoing breakthroughs in large language models (LLMs) are reshaping scholarly search and discovery interfaces. While these systems offer new possibilities for navigating scientific knowledge, they also raise concerns about fairness and representational bias rooted in the models' memorized training data. As LLMs are increasingly used to answer queries about researchers and research communities, their ability to accurately reconstruct scholarly coauthor lists becomes an important but underexamined issue. In this study, we investigate how memorization in LLMs affects the reconstruction of coauthor lists and whether this process reflects existing inequalities across academic disciplines and world regions. We evaluate three prominent models, DeepSeek R1, Llama 4 Scout, and Mixtral 8x7B, by comparing their generated coauthor lists against bibliographic reference data. Our analysis reveals a systematic advantage for highly cited researchers, indicating that LLM memorization disproportionately favors already visible scholars. However, this pattern is not uniform: certain disciplines, such as Clinical Medicine, and some regions, including parts of Africa, exhibit more balanced reconstruction outcomes. These findings highlight both the risks and limitations of relying on LLM-generated relational knowledge in scholarly discovery contexts and emphasize the need for careful auditing of memorization-driven biases in LLM-based systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates how memorization in LLMs affects reconstruction of scholarly coauthor lists, evaluating DeepSeek R1, Llama 4 Scout, and Mixtral 8x7B against bibliographic reference data. It reports a systematic advantage for highly cited researchers, indicating disproportionate favoritism toward visible scholars, while noting more balanced outcomes in disciplines such as Clinical Medicine and regions including parts of Africa.
Significance. If the attribution to LLM memorization holds after addressing potential confounds in the reference data, the findings would provide useful empirical evidence on fairness risks when LLMs are used for scholarly discovery and relational queries. The work contributes to auditing memorization-driven biases in AI systems applied to academic contexts.
major comments (2)
- Abstract: The central claim that LLM reconstruction exhibits a systematic advantage for highly cited researchers requires that differences in fidelity versus bibliographic lists can be attributed to memorization. This holds only if the reference data (e.g., Scopus or Web of Science) exhibits comparable completeness across citation strata and regions; lower coverage for low-citation or African-institution authors would produce the observed pattern even without differential LLM bias.
- Abstract: The reported more balanced reconstruction outcomes in Clinical Medicine and parts of Africa could reflect domain- or region-specific differences in bibliographic coverage rather than reduced LLM bias. Without explicit coverage audits, matching criteria, or statistical controls for reference completeness, the non-uniformity cannot be isolated from ground-truth artifacts.
minor comments (1)
- Abstract: The description of the three models and comparison procedure would benefit from a brief statement of sample size, query formulation, and exact matching method to allow readers to assess reproducibility from the outset.
Simulated Author's Rebuttal
We thank the referee for their careful reading and for highlighting important potential confounds related to bibliographic database coverage. We address each major comment below and have revised the manuscript to incorporate additional analyses, controls, and qualifications that strengthen the link between observed patterns and LLM memorization while acknowledging data limitations.
read point-by-point responses
-
Referee: Abstract: The central claim that LLM reconstruction exhibits a systematic advantage for highly cited researchers requires that differences in fidelity versus bibliographic lists can be attributed to memorization. This holds only if the reference data (e.g., Scopus or Web of Science) exhibits comparable completeness across citation strata and regions; lower coverage for low-citation or African-institution authors would produce the observed pattern even without differential LLM bias.
Authors: We agree that differential completeness in the reference data across citation strata and regions is a plausible alternative explanation that must be ruled out or controlled for before attributing discrepancies primarily to memorization. In the revised manuscript we have added a dedicated subsection to the Methods that reports coverage statistics (proportion of authors with verifiable coauthor lists) broken down by citation quartile and by region. We further include regression models that control for the number of indexed publications per author as a proxy for database visibility. These controls show that the systematic advantage for highly cited researchers persists at statistically significant levels. We have also revised the abstract to describe the advantage as 'consistent with memorization effects after accounting for reference-data coverage.' revision: yes
-
Referee: Abstract: The reported more balanced reconstruction outcomes in Clinical Medicine and parts of Africa could reflect domain- or region-specific differences in bibliographic coverage rather than reduced LLM bias. Without explicit coverage audits, matching criteria, or statistical controls for reference completeness, the non-uniformity cannot be isolated from ground-truth artifacts.
Authors: We acknowledge that domain- and region-specific coverage differences could produce the appearance of more balanced outcomes without any change in LLM behavior. The revised version now contains explicit coverage audits that compare the fraction of authors with complete versus partial coauthor information in the reference data for Clinical Medicine versus other disciplines and for African institutions versus other regions. We additionally report author-matching criteria (disambiguation thresholds) and include interaction terms between discipline/region and coverage metrics in our statistical models. These checks indicate that the relative balance in the noted fields and regions is not fully explained by coverage artifacts. The abstract and results sections have been updated to summarize these controls. revision: yes
Circularity Check
No circularity: empirical comparison to external bibliographic data
full rationale
The paper performs an empirical evaluation by prompting LLMs to reconstruct coauthor lists and directly comparing outputs against independent bibliographic reference data. The central claim of systematic advantage for highly cited researchers rests on observed differences in reconstruction fidelity across citation strata and regions. No equations, fitted parameters, self-definitional steps, or load-bearing self-citations appear in the provided derivation; the reference data serves as an external benchmark rather than an input that is redefined or predicted from the model outputs themselves. The analysis is therefore self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Bibliographic databases provide a complete and unbiased record of actual coauthorships.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:240201788
Agarwal S, Laradji IH, Charlin L, et al (2024) Litllm: A toolkit for scientific literature review. arXiv preprint arXiv:240201788
work page 2024
-
[2]
arXiv preprint arXiv:240417663
Alperin JP, Portenoy J, Demes K, et al (2024) An analysis of the suitability of openalex for bibliometric analyses. arXiv preprint arXiv:240417663
work page 2024
-
[3]
Bombieri M, Fiorini P, Ponzetto SP, et al (2024) Do llms dream of ontologies? arXiv preprint arXiv:240114931
work page 2024
-
[4]
Applied Network Science 4(1):1–17
Bravo-Hermsdorff G, Felso V, Ray E, et al (2019) Gender and collaboration patterns in a temporal scientific authorship network. Applied Network Science 4(1):1–17
work page 2019
-
[5]
In: 30th USENIX security symposium (USENIX Security 21), pp 2633–2650
Carlini N, Tramer F, Wallace E, et al (2021) Extracting training data from large language models. In: 30th USENIX security symposium (USENIX Security 21), pp 2633–2650
work page 2021
-
[6]
In: The Eleventh International Conference on Learning Representations
Carlini N, Ippolito D, Jagielski M, et al (2022) Quantifying memorization across neural language models. In: The Eleventh International Conference on Learning Representations
work page 2022
-
[7]
https://doi.org/10.5281/ zenodo.5764801, URL https://github.com/scholarly-python-package/scholarly
Cholewiak SA, Ipeirotis P, Silva V, et al (2021) SCHOLARLY: Simple access to Google Scholar authors and citation using Python. https://doi.org/10.5281/ zenodo.5764801, URL https://github.com/scholarly-python-package/scholarly
work page 2021
-
[8]
Scientometrics 130(4):2475–2492
Culbert JH, Hobert A, Jahn N, et al (2025) Reference coverage analysis of openalex compared to web of science and scopus. Scientometrics 130(4):2475–2492
work page 2025
-
[9]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI, Guo D, Yang D, et al (2025) Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning. arXiv preprint arXiv:250112948 URL https://arxiv.org/abs/2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Journal of the Knowledge Economy 14(2):1503– 1521
Diop S, Asongu SA (2023) Research productivity: Trend and comparative anal- yses by regions and continents. Journal of the Knowledge Economy 14(2):1503– 1521
work page 2023
-
[11]
Google (2024) Google maps platform documentation. https://developers.google. com/maps/documentation, accessed: 2025-05-23 23
work page 2024
-
[12]
Grodzinski N, Grodzinski B, Davies BM (2021) Can co-authorship networks be used to predict author research impact? a machine-learning based analysis within the field of degenerative cervical myelopathy research. Plos one 16(9):e0256997
work page 2021
-
[13]
arXiv preprint arXiv:240713993
Haryanto CY (2024) Llassist: Simple tools for automating literature review using large language models. arXiv preprint arXiv:240713993
work page 2024
-
[14]
Hayes J, Swanberg M, Chaudhari H, et al (2025) Measuring memorization in language models via probabilistic extraction. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp 9266–9291
work page 2025
-
[15]
Bull Soc Vaudoise Sci Nat 37:547–579
Jaccard P (1901) ´Etude comparative de la distribution florale dans une portion des alpes et des jura. Bull Soc Vaudoise Sci Nat 37:547–579
work page 1901
-
[16]
Jiang AQ, Sablayrolles A, Roux A, et al (2024) Mixtral of experts. arXiv preprint arXiv:240104088 URL https://arxiv.org/abs/2401.04088
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Applied Network Science 7(1):21
Kalhor G, Asadi Sarijalou A, Sharifi Sadr N, et al (2022) A new insight to the analysis of co-authorship in google scholar. Applied Network Science 7(1):21
work page 2022
-
[18]
Kalhor G, Ali S, Mashhadi A (2025) Measuring biases in ai-generated co- authorship networks. EPJ Data Science 14(1):1–33
work page 2025
-
[19]
The Journal of the Canadian Health Libraries Association 44(1):15
Kung JY (2023) Elicit. The Journal of the Canadian Health Libraries Association 44(1):15
work page 2023
-
[20]
arXiv preprint arXiv:240302574
Li Y, Chen L, Liu A, et al (2024) Chatcite: Llm agent with human workflow guidance for comparative literature summary. arXiv preprint arXiv:240302574
work page 2024
-
[21]
Tapuya: Latin American Science, Technology and Society 5(1):2037819
L´ opez-Aguirre C, Far´ ıas D (2022) The mirage of scientific productivity and how women are left behind: the colombian case. Tapuya: Latin American Science, Technology and Society 5(1):2037819
work page 2022
-
[22]
Luong T, Le TT, Ngo L, et al (2024) Realistic evaluation of toxicity in large language models. In: Ku LW, Martins A, Srikumar V (eds) Findings of the Association for Computational Linguistics ACL 2024. Association for Computa- tional Linguistics, Bangkok, Thailand and virtual meeting, pp 1038–1047, https: //doi.org/10.18653/v1/2024.findings-acl.61, URL ht...
-
[23]
In: International Workshop on Complex Networks, Springer, pp 120–131 24
Macedo M, Jaramillo AM, Menezes R (2023) Academic mobility as a driver of productivity: A gender-centric approach. In: International Workshop on Complex Networks, Springer, pp 120–131 24
work page 2023
-
[24]
Magar I, Schwartz R (2022) Data contamination: From memorization to exploitation. In: Muresan S, Nakov P, Villavicencio A (eds) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Dublin, Ireland, pp 157–165, https://doi.org/10.18653/v1/2022.acl-short....
-
[25]
arXiv preprint arXiv:240202680
Manvi R, Khanna S, Burke M, et al (2024) Large language models are geograph- ically biased. arXiv preprint arXiv:240202680
work page 2024
-
[26]
Meta AI Blog, URL https://ai.meta.com/blog/ llama-4-multimodal-intelligence/, april 5, 2025
Meta AI (2025) The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. Meta AI Blog, URL https://ai.meta.com/blog/ llama-4-multimodal-intelligence/, april 5, 2025
work page 2025
-
[27]
arXiv preprint arXiv:231117035
Nasr M, Carlini N, Hayase J, et al (2023) Scalable extraction of training data from (production) language models. arXiv preprint arXiv:231117035
work page 2023
-
[28]
Nguyen TT, Wilson C, Dalins J (2023) Fine-tuning llama 2 large language models for detecting online sexual predatory chats and abusive texts. URL https://arxiv. org/abs/2308.14683, 2308.14683
-
[29]
arXiv preprint arXiv:220501833
Priem J, Piwowar H, Orr R (2022) Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:220501833
work page 2022
-
[30]
arXiv preprint arXiv:250515501 URL https://arxiv.org/abs/2505.15501
Ranaldi F, Zugarini A, Ranaldi L, et al (2025) Protoknowledge shapes behaviour of llms in downstream tasks: Memorization and generalization with knowledge graphs. arXiv preprint arXiv:250515501 URL https://arxiv.org/abs/2505.15501
-
[31]
arXiv preprint arXiv:240900159 URL https://arxiv.org/abs/2409.00159
Richardeau G, Chali S, Le Merrer E, et al (2024) Llms prompted for graphs: Hallucinations and generative capabilities. arXiv preprint arXiv:240900159 URL https://arxiv.org/abs/2409.00159
-
[32]
In: The Eleventh International Conference on Learning Representations
Saparov A, He H (2023) Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In: The Eleventh International Conference on Learning Representations
work page 2023
-
[33]
LLaMA: Open and Efficient Foundation Language Models
Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. ArXiv abs/2302.13971. URL https://api.semanticscholar.org/ CorpusID:257219404
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
arXiv preprint arXiv:240714985
Wang X, Antoniades A, Elazar Y, et al (2024) Generalization vs memorization: Tracing language models’ capabilities back to pretraining data. arXiv preprint arXiv:240714985
work page 2024
-
[35]
memorization: Tracing language models’ capabilities back to pretraining data
Wang X, Antoniades A, Elazar Y, et al (2025) Generalization v.s. memorization: Tracing language models’ capabilities back to pretraining data. arXiv preprint 25 arXiv:240714985 URL https://arxiv.org/abs/2407.14985 26
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.