pith. sign in

arxiv: 2511.00476 · v2 · submitted 2025-11-01 · 💻 cs.CL

Remembering Unequally: Global and Disciplinary Bias in LLM Reconstruction of Scholarly Coauthor Lists

Pith reviewed 2026-05-18 01:44 UTC · model grok-4.3

classification 💻 cs.CL
keywords LLM memorizationcoauthor reconstructionscholarly biasacademic networksdisciplinary differencesregional inequalityAI fairnessresearch discovery
0
0 comments X

The pith

Large language models reconstruct coauthor lists with a clear bias toward highly cited researchers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are now used to answer questions about who works with whom in science. This paper tests whether three leading models can accurately recall coauthor lists or whether they simply reproduce patterns from their training data. The results show that the models consistently over-represent already prominent scholars when measured against real publication records. The imbalance is not uniform: clinical medicine and some regions in Africa produce more even lists. If the pattern holds, reliance on these models for academic search will tend to widen the gap between visible and less-visible researchers.

Core claim

When prompted to list coauthors, DeepSeek R1, Llama 4 Scout, and Mixtral 8x7B produce lists that systematically favor highly cited researchers relative to bibliographic reference data. The same models generate more balanced lists for certain disciplines such as Clinical Medicine and for researchers in parts of Africa.

What carries the argument

Direct comparison of LLM-generated coauthor lists against bibliographic reference data to measure memorization bias.

If this is right

  • Scholarly search interfaces that rely on LLMs will tend to surface already prominent names more often than others.
  • Queries about researchers in clinical medicine or certain African regions are likely to return more representative results than queries in other fields or regions.
  • Developers of academic tools will need to audit memorization effects before deploying LLM-based relationship features.
  • Uneven recall could slow the discovery of collaborative work involving less-cited scholars.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Interfaces that surface LLM-generated coauthor suggestions could be supplemented with explicit links to public citation databases to offset the bias.
  • Training data that under-samples certain regions or disciplines will continue to shape what LLMs treat as normal academic networks.
  • Similar tests could be run on other relational tasks, such as suggesting collaborators or recalling citation chains, to check whether the same visibility bias appears.

Load-bearing premise

Bibliographic reference data serves as a complete and unbiased record of actual coauthor relationships.

What would settle it

Collecting self-reported coauthor lists from a diverse sample of researchers and comparing them directly to the same LLM outputs would show whether the observed bias survives outside the reference dataset.

read the original abstract

Ongoing breakthroughs in large language models (LLMs) are reshaping scholarly search and discovery interfaces. While these systems offer new possibilities for navigating scientific knowledge, they also raise concerns about fairness and representational bias rooted in the models' memorized training data. As LLMs are increasingly used to answer queries about researchers and research communities, their ability to accurately reconstruct scholarly coauthor lists becomes an important but underexamined issue. In this study, we investigate how memorization in LLMs affects the reconstruction of coauthor lists and whether this process reflects existing inequalities across academic disciplines and world regions. We evaluate three prominent models, DeepSeek R1, Llama 4 Scout, and Mixtral 8x7B, by comparing their generated coauthor lists against bibliographic reference data. Our analysis reveals a systematic advantage for highly cited researchers, indicating that LLM memorization disproportionately favors already visible scholars. However, this pattern is not uniform: certain disciplines, such as Clinical Medicine, and some regions, including parts of Africa, exhibit more balanced reconstruction outcomes. These findings highlight both the risks and limitations of relying on LLM-generated relational knowledge in scholarly discovery contexts and emphasize the need for careful auditing of memorization-driven biases in LLM-based systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper investigates how memorization in LLMs affects reconstruction of scholarly coauthor lists, evaluating DeepSeek R1, Llama 4 Scout, and Mixtral 8x7B against bibliographic reference data. It reports a systematic advantage for highly cited researchers, indicating disproportionate favoritism toward visible scholars, while noting more balanced outcomes in disciplines such as Clinical Medicine and regions including parts of Africa.

Significance. If the attribution to LLM memorization holds after addressing potential confounds in the reference data, the findings would provide useful empirical evidence on fairness risks when LLMs are used for scholarly discovery and relational queries. The work contributes to auditing memorization-driven biases in AI systems applied to academic contexts.

major comments (2)
  1. Abstract: The central claim that LLM reconstruction exhibits a systematic advantage for highly cited researchers requires that differences in fidelity versus bibliographic lists can be attributed to memorization. This holds only if the reference data (e.g., Scopus or Web of Science) exhibits comparable completeness across citation strata and regions; lower coverage for low-citation or African-institution authors would produce the observed pattern even without differential LLM bias.
  2. Abstract: The reported more balanced reconstruction outcomes in Clinical Medicine and parts of Africa could reflect domain- or region-specific differences in bibliographic coverage rather than reduced LLM bias. Without explicit coverage audits, matching criteria, or statistical controls for reference completeness, the non-uniformity cannot be isolated from ground-truth artifacts.
minor comments (1)
  1. Abstract: The description of the three models and comparison procedure would benefit from a brief statement of sample size, query formulation, and exact matching method to allow readers to assess reproducibility from the outset.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and for highlighting important potential confounds related to bibliographic database coverage. We address each major comment below and have revised the manuscript to incorporate additional analyses, controls, and qualifications that strengthen the link between observed patterns and LLM memorization while acknowledging data limitations.

read point-by-point responses
  1. Referee: Abstract: The central claim that LLM reconstruction exhibits a systematic advantage for highly cited researchers requires that differences in fidelity versus bibliographic lists can be attributed to memorization. This holds only if the reference data (e.g., Scopus or Web of Science) exhibits comparable completeness across citation strata and regions; lower coverage for low-citation or African-institution authors would produce the observed pattern even without differential LLM bias.

    Authors: We agree that differential completeness in the reference data across citation strata and regions is a plausible alternative explanation that must be ruled out or controlled for before attributing discrepancies primarily to memorization. In the revised manuscript we have added a dedicated subsection to the Methods that reports coverage statistics (proportion of authors with verifiable coauthor lists) broken down by citation quartile and by region. We further include regression models that control for the number of indexed publications per author as a proxy for database visibility. These controls show that the systematic advantage for highly cited researchers persists at statistically significant levels. We have also revised the abstract to describe the advantage as 'consistent with memorization effects after accounting for reference-data coverage.' revision: yes

  2. Referee: Abstract: The reported more balanced reconstruction outcomes in Clinical Medicine and parts of Africa could reflect domain- or region-specific differences in bibliographic coverage rather than reduced LLM bias. Without explicit coverage audits, matching criteria, or statistical controls for reference completeness, the non-uniformity cannot be isolated from ground-truth artifacts.

    Authors: We acknowledge that domain- and region-specific coverage differences could produce the appearance of more balanced outcomes without any change in LLM behavior. The revised version now contains explicit coverage audits that compare the fraction of authors with complete versus partial coauthor information in the reference data for Clinical Medicine versus other disciplines and for African institutions versus other regions. We additionally report author-matching criteria (disambiguation thresholds) and include interaction terms between discipline/region and coverage metrics in our statistical models. These checks indicate that the relative balance in the noted fields and regions is not fully explained by coverage artifacts. The abstract and results sections have been updated to summarize these controls. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison to external bibliographic data

full rationale

The paper performs an empirical evaluation by prompting LLMs to reconstruct coauthor lists and directly comparing outputs against independent bibliographic reference data. The central claim of systematic advantage for highly cited researchers rests on observed differences in reconstruction fidelity across citation strata and regions. No equations, fitted parameters, self-definitional steps, or load-bearing self-citations appear in the provided derivation; the reference data serves as an external benchmark rather than an input that is redefined or predicted from the model outputs themselves. The analysis is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that bibliographic records constitute an objective external benchmark and that differences between LLM output and those records can be attributed to memorization rather than other factors such as query formulation or model decoding choices.

axioms (1)
  • domain assumption Bibliographic databases provide a complete and unbiased record of actual coauthorships.
    Invoked when the paper treats reference data as ground truth for measuring reconstruction accuracy.

pith-pipeline@v0.9.0 · 5749 in / 1307 out tokens · 26755 ms · 2026-05-18T01:44:59.206696+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

  1. [1]

    arXiv preprint arXiv:240201788

    Agarwal S, Laradji IH, Charlin L, et al (2024) Litllm: A toolkit for scientific literature review. arXiv preprint arXiv:240201788

  2. [2]

    arXiv preprint arXiv:240417663

    Alperin JP, Portenoy J, Demes K, et al (2024) An analysis of the suitability of openalex for bibliometric analyses. arXiv preprint arXiv:240417663

  3. [3]

    Bombieri M, Fiorini P, Ponzetto SP, et al (2024) Do llms dream of ontologies? arXiv preprint arXiv:240114931

  4. [4]

    Applied Network Science 4(1):1–17

    Bravo-Hermsdorff G, Felso V, Ray E, et al (2019) Gender and collaboration patterns in a temporal scientific authorship network. Applied Network Science 4(1):1–17

  5. [5]

    In: 30th USENIX security symposium (USENIX Security 21), pp 2633–2650

    Carlini N, Tramer F, Wallace E, et al (2021) Extracting training data from large language models. In: 30th USENIX security symposium (USENIX Security 21), pp 2633–2650

  6. [6]

    In: The Eleventh International Conference on Learning Representations

    Carlini N, Ippolito D, Jagielski M, et al (2022) Quantifying memorization across neural language models. In: The Eleventh International Conference on Learning Representations

  7. [7]

    https://doi.org/10.5281/ zenodo.5764801, URL https://github.com/scholarly-python-package/scholarly

    Cholewiak SA, Ipeirotis P, Silva V, et al (2021) SCHOLARLY: Simple access to Google Scholar authors and citation using Python. https://doi.org/10.5281/ zenodo.5764801, URL https://github.com/scholarly-python-package/scholarly

  8. [8]

    Scientometrics 130(4):2475–2492

    Culbert JH, Hobert A, Jahn N, et al (2025) Reference coverage analysis of openalex compared to web of science and scopus. Scientometrics 130(4):2475–2492

  9. [9]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, Guo D, Yang D, et al (2025) Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning. arXiv preprint arXiv:250112948 URL https://arxiv.org/abs/2501.12948

  10. [10]

    Journal of the Knowledge Economy 14(2):1503– 1521

    Diop S, Asongu SA (2023) Research productivity: Trend and comparative anal- yses by regions and continents. Journal of the Knowledge Economy 14(2):1503– 1521

  11. [11]

    https://developers.google

    Google (2024) Google maps platform documentation. https://developers.google. com/maps/documentation, accessed: 2025-05-23 23

  12. [12]

    Plos one 16(9):e0256997

    Grodzinski N, Grodzinski B, Davies BM (2021) Can co-authorship networks be used to predict author research impact? a machine-learning based analysis within the field of degenerative cervical myelopathy research. Plos one 16(9):e0256997

  13. [13]

    arXiv preprint arXiv:240713993

    Haryanto CY (2024) Llassist: Simple tools for automating literature review using large language models. arXiv preprint arXiv:240713993

  14. [14]

    Hayes J, Swanberg M, Chaudhari H, et al (2025) Measuring memorization in language models via probabilistic extraction. In: Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Com- putational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp 9266–9291

  15. [15]

    Bull Soc Vaudoise Sci Nat 37:547–579

    Jaccard P (1901) ´Etude comparative de la distribution florale dans une portion des alpes et des jura. Bull Soc Vaudoise Sci Nat 37:547–579

  16. [16]

    Mixtral of Experts

    Jiang AQ, Sablayrolles A, Roux A, et al (2024) Mixtral of experts. arXiv preprint arXiv:240104088 URL https://arxiv.org/abs/2401.04088

  17. [17]

    Applied Network Science 7(1):21

    Kalhor G, Asadi Sarijalou A, Sharifi Sadr N, et al (2022) A new insight to the analysis of co-authorship in google scholar. Applied Network Science 7(1):21

  18. [18]

    EPJ Data Science 14(1):1–33

    Kalhor G, Ali S, Mashhadi A (2025) Measuring biases in ai-generated co- authorship networks. EPJ Data Science 14(1):1–33

  19. [19]

    The Journal of the Canadian Health Libraries Association 44(1):15

    Kung JY (2023) Elicit. The Journal of the Canadian Health Libraries Association 44(1):15

  20. [20]

    arXiv preprint arXiv:240302574

    Li Y, Chen L, Liu A, et al (2024) Chatcite: Llm agent with human workflow guidance for comparative literature summary. arXiv preprint arXiv:240302574

  21. [21]

    Tapuya: Latin American Science, Technology and Society 5(1):2037819

    L´ opez-Aguirre C, Far´ ıas D (2022) The mirage of scientific productivity and how women are left behind: the colombian case. Tapuya: Latin American Science, Technology and Society 5(1):2037819

  22. [22]

    In: Ku LW, Martins A, Srikumar V (eds) Findings of the Association for Computational Linguistics ACL 2024

    Luong T, Le TT, Ngo L, et al (2024) Realistic evaluation of toxicity in large language models. In: Ku LW, Martins A, Srikumar V (eds) Findings of the Association for Computational Linguistics ACL 2024. Association for Computa- tional Linguistics, Bangkok, Thailand and virtual meeting, pp 1038–1047, https: //doi.org/10.18653/v1/2024.findings-acl.61, URL ht...

  23. [23]

    In: International Workshop on Complex Networks, Springer, pp 120–131 24

    Macedo M, Jaramillo AM, Menezes R (2023) Academic mobility as a driver of productivity: A gender-centric approach. In: International Workshop on Complex Networks, Springer, pp 120–131 24

  24. [24]

    In: Muresan S, Nakov P, Villavicencio A (eds) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

    Magar I, Schwartz R (2022) Data contamination: From memorization to exploitation. In: Muresan S, Nakov P, Villavicencio A (eds) Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Dublin, Ireland, pp 157–165, https://doi.org/10.18653/v1/2022.acl-short....

  25. [25]

    arXiv preprint arXiv:240202680

    Manvi R, Khanna S, Burke M, et al (2024) Large language models are geograph- ically biased. arXiv preprint arXiv:240202680

  26. [26]

    Meta AI Blog, URL https://ai.meta.com/blog/ llama-4-multimodal-intelligence/, april 5, 2025

    Meta AI (2025) The llama 4 herd: The beginning of a new era of natively multimodal ai innovation. Meta AI Blog, URL https://ai.meta.com/blog/ llama-4-multimodal-intelligence/, april 5, 2025

  27. [27]

    arXiv preprint arXiv:231117035

    Nasr M, Carlini N, Hayase J, et al (2023) Scalable extraction of training data from (production) language models. arXiv preprint arXiv:231117035

  28. [28]

    URL https://arxiv

    Nguyen TT, Wilson C, Dalins J (2023) Fine-tuning llama 2 large language models for detecting online sexual predatory chats and abusive texts. URL https://arxiv. org/abs/2308.14683, 2308.14683

  29. [29]

    arXiv preprint arXiv:220501833

    Priem J, Piwowar H, Orr R (2022) Openalex: A fully-open index of scholarly works, authors, venues, institutions, and concepts. arXiv preprint arXiv:220501833

  30. [30]

    arXiv preprint arXiv:250515501 URL https://arxiv.org/abs/2505.15501

    Ranaldi F, Zugarini A, Ranaldi L, et al (2025) Protoknowledge shapes behaviour of llms in downstream tasks: Memorization and generalization with knowledge graphs. arXiv preprint arXiv:250515501 URL https://arxiv.org/abs/2505.15501

  31. [31]

    arXiv preprint arXiv:240900159 URL https://arxiv.org/abs/2409.00159

    Richardeau G, Chali S, Le Merrer E, et al (2024) Llms prompted for graphs: Hallucinations and generative capabilities. arXiv preprint arXiv:240900159 URL https://arxiv.org/abs/2409.00159

  32. [32]

    In: The Eleventh International Conference on Learning Representations

    Saparov A, He H (2023) Language models are greedy reasoners: A systematic formal analysis of chain-of-thought. In: The Eleventh International Conference on Learning Representations

  33. [33]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron H, Lavril T, Izacard G, et al (2023) Llama: Open and efficient foundation language models. ArXiv abs/2302.13971. URL https://api.semanticscholar.org/ CorpusID:257219404

  34. [34]

    arXiv preprint arXiv:240714985

    Wang X, Antoniades A, Elazar Y, et al (2024) Generalization vs memorization: Tracing language models’ capabilities back to pretraining data. arXiv preprint arXiv:240714985

  35. [35]

    memorization: Tracing language models’ capabilities back to pretraining data

    Wang X, Antoniades A, Elazar Y, et al (2025) Generalization v.s. memorization: Tracing language models’ capabilities back to pretraining data. arXiv preprint 25 arXiv:240714985 URL https://arxiv.org/abs/2407.14985 26