pith. sign in

arxiv: 2606.02184 · v1 · pith:4QR5K3MFnew · submitted 2026-06-01 · 💻 cs.DL · cs.LG

The Ghost Couple: Correlated LLM Name Priors and Their Haunting of the Web and Academic Publishing

Pith reviewed 2026-06-28 11:56 UTC · model grok-4.3

classification 💻 cs.DL cs.LG
keywords LLM name priorsghost authorscorrelated name ensemblesZenodo recordsDataCite DOIsAI-generated publicationsmodel fingerprintssynthetic research groups
0
0 comments X

The pith

Large language models generate the same pairs of fictional expert names together across independent documents at rates far above chance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLMs produce correlated ensembles of names rather than selecting individuals independently, with specific pairs and trios recurring consistently within each model family. These patterns appear in generated text across many contexts and create traceable artifacts when the output enters public repositories. On Zenodo, 1,655 records use these ghost names with fabricated journal details and backdated timestamps yet carry real DataCite DOIs that any aggregator can harvest. The co-occurrence rates and version-specific suppression provide behavioral fingerprints that date model deployment windows. A sympathetic reader would care because the mechanism turns random-looking fictional content into a detectable signal of model origin and timing.

Core claim

Large language models do not merely default to high-probability individual names when generating fictional experts: they produce correlated character ensembles, pairs and trios whose co-occurrence rates far exceed chance and are consistent across independent generations. These priors are model-family-specific (Claude: Elena Vasquez + Marcus Chen + Amara Okafor; Gemini: Aris Thorne + Lena Petrova; GPT: Elara Voss with no fixed partner), version-specific, and actively suppressed at model release boundaries, leaving dateable behavioral fingerprints in the content they produced. On Zenodo a CERN-operated repository that mints real DataCite DOIs, 1,655 ghost-authored records claim nonexistent jou

What carries the argument

Correlated LLM name priors: model-family-specific pairs and trios of fictional names that co-occur at elevated rates in generated text.

If this is right

  • Name pairs can serve as model-family and version fingerprints in generated content.
  • Publication dates and timestamps on ghost records provide a temporal proxy for model deployment and update windows.
  • Real DOIs assigned to fabricated records allow the entries to enter scholarly aggregators and databases.
  • Synthetic research groups form on platforms such as ResearchGate by mixing names from different model families.
  • Backdating of timestamps indicates intentional manipulation in the generated publication metadata.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Search engines and DOI registries could add co-occurrence checks on author names to surface clusters of likely synthetic records.
  • The same correlation mechanism may appear in non-academic domains such as news or fiction, offering a general signal for tracing LLM output.
  • Extending the analysis to other name types or languages could reveal whether the priors are language-specific or universal within a model family.
  • If the priors persist in newer model versions, they could be used to track continued use of older training data or generation behaviors.

Load-bearing premise

The 1,655 Zenodo records are produced by LLMs using the identified name priors rather than representing real but obscure publications or manual forgeries unrelated to model output.

What would settle it

A controlled test that samples thousands of documents from each model family and measures whether the claimed name pairs co-occur at rates statistically indistinguishable from chance would falsify the central claim if the rates match random expectation.

Figures

Figures reproduced from arXiv: 2606.02184 by Micha{\l} Brzozowski, Neo Christopher Chung.

Figure 1
Figure 1. Figure 1: Elena Vasquez and Marcus Chen co-appearing across seven independently produced AI-generated pages [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Claude ghost trio co-occurring across three independent websites (rows), grouped by surname [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Elena Vasquez, Marcus Chen, and their pair [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Monthly ghost vs. control surname co-occurrence on ResearchGate. Control surnames (demographically [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Monthly Zenodo upload counts for ghost-authored records ( [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-surname temporal distribution, ghost (left) vs. control (right), shared [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
read the original abstract

These names do not exist. Elena Vasquez and Marcus Chen have appeared as volcano experts, astronauts, thriller protagonists, podcast hosts, and academic co-authors across hundreds of independently produced AI-generated documents, never having lived. We show that large language models do not merely default to high-probability individual names when generating fictional experts: they produce correlated character ensembles, pairs and trios whose co-occurrence rates far exceed chance and are consistent across independent generations. These priors are model-family-specific (Claude: Elena Vasquez + Marcus Chen + Amara Okafor; Gemini: Aris Thorne + Lena Petrova; GPT: Elara Voss with no fixed partner), version-specific, and actively suppressed at model release boundaries, leaving dateable behavioral fingerprints in the content they produced. We document a downstream consequence at scale. On Zenodo, a CERN-operated repository that mints real DataCite DOIs, we identify 1,655 ghost-authored records claiming nonexistent journals with fabricated publication dates: server-side DataCite timestamps prove deliberate backdating, and 991 records were registered in a single month; these carry real DOIs registered in DataCite, making them harvestable by any scholarly aggregator that ingests DOI metadata. Ghost names additionally appear on ResearchGate forming synthetic research groups with collaborators drawn from multiple model families; publication dates on these records provide a reliable temporal proxy for model deployment windows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper claims that LLMs produce model-family-specific correlated ensembles of fictional names (e.g., Elena Vasquez + Marcus Chen + Amara Okafor for Claude; Aris Thorne + Lena Petrova for Gemini) whose co-occurrence rates exceed chance and persist across independent generations, leaving version-specific fingerprints; these priors manifest at scale in 1,655 ghost-authored Zenodo records claiming nonexistent journals with backdated dates (991 in one month), verified via server-side DataCite timestamps, that carry real DOIs and appear on ResearchGate.

Significance. If substantiated, the result identifies a new class of persistent, dateable LLM artifacts in scholarly repositories that can be harvested by aggregators, offering both a forensic signal for AI-generated content and a concrete mechanism by which synthetic names can contaminate DOI metadata.

major comments (3)
  1. [Abstract] Abstract: the claim that co-occurrence rates 'far exceed chance' provides no description of the null model, baseline corpus, or statistical procedure used to establish the excess, leaving the quantitative core of the central claim unevaluable.
  2. [Abstract] Abstract: the attribution of the 1,655 Zenodo records to LLM name priors rests on name matching but supplies no sampling protocol, verification steps for nonexistence of the named individuals or journals, or error analysis that would exclude manual forgeries, data-entry artifacts, or obscure real authors.
  3. [Abstract] Abstract: server-side DataCite timestamps are presented as proof of deliberate backdating, yet the manuscript gives no account of how these timestamps were retrieved, their granularity, or any control comparison against legitimate records.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed comments on the abstract. We agree that additional methodological detail is needed for evaluability and will revise the abstract accordingly while preserving its length constraints. Point-by-point responses are below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that co-occurrence rates 'far exceed chance' provides no description of the null model, baseline corpus, or statistical procedure used to establish the excess, leaving the quantitative core of the central claim unevaluable.

    Authors: We agree the abstract omits these details. The full manuscript (Section 3) specifies the null model, baseline corpus, and statistical procedure. We will revise the abstract to include a concise description of the null model and procedure. revision: yes

  2. Referee: [Abstract] Abstract: the attribution of the 1,655 Zenodo records to LLM name priors rests on name matching but supplies no sampling protocol, verification steps for nonexistence of the named individuals or journals, or error analysis that would exclude manual forgeries, data-entry artifacts, or obscure real authors.

    Authors: We agree the abstract does not detail the sampling protocol, verification steps, or error analysis. These are provided in Section 4 of the manuscript. We will revise the abstract to summarize the sampling protocol, verification approach, and error analysis. revision: yes

  3. Referee: [Abstract] Abstract: server-side DataCite timestamps are presented as proof of deliberate backdating, yet the manuscript gives no account of how these timestamps were retrieved, their granularity, or any control comparison against legitimate records.

    Authors: The comment is correct that the manuscript provides no account of timestamp retrieval, granularity, or control comparisons. We will revise the abstract (and, if space permits, the main text) to describe the retrieval method, granularity, and any control comparisons performed. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical observations only

full rationale

The paper's central claims rest on direct counting of name co-occurrences in LLM-generated text and Zenodo metadata timestamps. No equations, parameter fits, predictions derived from inputs, or load-bearing self-citations are present in the abstract or described methodology. The identification of model-family-specific ensembles and backdated records is presented as observational measurement without any reduction to self-definition or fitted inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a purely empirical observational study with no mathematical model, derivations, or postulated entities; no free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5787 in / 1121 out tokens · 32306 ms · 2026-06-28T11:56:01.250030+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing

    cs.LG 2026-05 unverdicted novelty 7.0

    Contrastive Decoding Diffing recovers exact implanted facts from finetuned LLMs via logit-space differences between finetuned and base models, outperforming white-box baselines with less access.

Reference graph

Works this paper leans on

90 extracted references · cited by 1 Pith paper

  1. [1]

    Write a brief bio for a fictional researcher who studies marine biology

  2. [2]

    Create a character description for a scientist working on quantum computing

  3. [3]

    Invent a fictional neuroscientist for a short story

  4. [4]

    Write a brief profile for a made-up researcher in climate science

  5. [5]

    Create a fictional economist who studies in- come inequality

  6. [6]

    Invent a character who is an astrophysicist in a science fiction novel

  7. [7]

    Write a short bio for a fictional materials sci- entist

  8. [8]

    Create a fictional historian who specializes in ancient Rome

  9. [9]

    Invent a researcher character who studies in- fectious diseases

  10. [10]

    Write a profile for a made-up AI researcher

  11. [11]

    Give their name and institution

    For a fictional paper, invent the lead author. Give their name and institution

  12. [12]

    Create a scientist character (just name and specialty)

    I’m writing a thriller novel. Create a scientist character (just name and specialty)

  13. [13]

    Draft a fictional press release mentioning a researcher who made a breakthrough

  14. [14]

    Write the acknowledgments section for a fake paper, mentioning the lead researcher

  15. [15]

    Create a fictional university professor for a sci-fi screenplay

  16. [16]

    Invent a researcher quoted in a news article about gene therapy

  17. [17]

    Write a conference speaker bio for a fictional scientist

  18. [18]

    Create a fictional lab director for a biotech company in a novel

  19. [19]

    Invent the name and specialty of a researcher in a hypothetical study

  20. [20]

    Write a brief mention of a fictional expert in a fake news article about vaccines

  21. [21]

    Give me the name of a fictional researcher, just a name and field

  22. [22]

    What might a fictional scientist studying dark matter be named?

  23. [23]

    Invent a plausible name for a fictional re- searcher in oncology

  24. [24]

    Create a made-up researcher name for a robotics paper

  25. [25]

    What’s a realistic name for a fictional com- puter scientist?

  26. [26]

    Write a short bio for a fictional researcher in CRISPR technology

  27. [27]

    Invent a fictional researcher who studies the gut microbiome

  28. [28]

    Create a character who is a fictional linguist studying language acquisition

  29. [29]

    Write a profile for a fictional sociologist study- ing urban poverty

  30. [30]

    A.5 Prompt Set B: Pair (30 prompts) These prompts request two fictional experts in col- laboration, dialogue, or narrative contexts

    Invent a fictional researcher in renewable en- ergy materials. A.5 Prompt Set B: Pair (30 prompts) These prompts request two fictional experts in col- laboration, dialogue, or narrative contexts. The set covers collaboration/dialogue (5), team/lab fram- ing (5), narrative pair (5), minimal (5), long-form narrative (5), and institution/group contexts (5)

  31. [31]

    Write a scene where two fictional scientists debate their competing theories

  32. [32]

    Give their names and fields

    Describe a research partnership between two fictional scientists. Give their names and fields

  33. [33]

    Write the opening of a thriller where two sci- entists discover something dangerous

  34. [34]

    Give their names and specialties

    Two fictional researchers are co-authoring a paper. Give their names and specialties

  35. [35]

    Write a dialogue between two scientists work- ing late in the lab on a breakthrough

  36. [36]

    Focus on the two lead scien- tists

    Describe the founding team of a fictional re- search institute. Focus on the two lead scien- tists

  37. [37]

    Write the ‘about us’ page for a fictional re- search lab, featuring the two directors

  38. [38]

    Describe them

    A fictional biotech startup is led by two scien- tist co-founders. Describe them

  39. [39]

    Write the acknowledgments of a fake paper thanking the two principal investigators

  40. [40]

    Give names and their contributions

    Invent two fictional scientists who share a No- bel Prize. Give names and their contributions

  41. [41]

    Write a short bio for two fictional scientists who collaborate on climate research

  42. [42]

    Create two characters for a science fiction novel: a physicist and a biologist working together

  43. [43]

    Write a press release announcing a break- through by a team of two fictional researchers

  44. [44]

    Invent two fictional neuroscientists — one op- timistic, one skeptical — for a documentary

  45. [45]

    Write an introduction for a fake podcast episode featuring two scientist guests

  46. [46]

    Just names and fields

    Give me the names of two fictional scientists who work together. Just names and fields

  47. [47]

    What would two fictional co-authors of a land- mark paper be named?

  48. [48]

    Invent a male and female scientist duo for a sci-fi story

  49. [49]

    Name two fictional researchers who might share a lab

  50. [50]

    Give their names

    Create a fictional mentor-student pair of sci- entists. Give their names

  51. [51]

    Write the opening chapter of a sci-fi novel featuring two scientist protagonists

  52. [52]

    Write a short story about two researchers who make a dangerous discovery

  53. [53]

    Draft a Wikipedia-style article about a fic- tional scientific duo and their work

  54. [54]

    Write a scene where two fictional doctors ar- gue about experimental treatment ethics

  55. [55]

    Describe two fictional scientists appearing as expert witnesses in a trial

  56. [56]

    Include at least two names

    List the faculty of a fictional university neu- roscience department. Include at least two names

  57. [57]

    Write a fake university webpage for a research center, naming the two lead investigators

  58. [58]

    Invent two fictional researchers quoted in a news article about gene editing

  59. [59]

    Write a fake grant proposal listing two fic- tional co-principal investigators

  60. [60]

    A.6 Prompt Set C: Trio (30 prompts) These prompts request three fictional experts

    Describe two scientists on the crew of a fic- tional space mission. A.6 Prompt Set C: Trio (30 prompts) These prompts request three fictional experts. Run on Claude (all checkpoints with trio results) and Gemini only; not run on GPT. Structure mirrors Prompt Set B with three-person framing

  61. [61]

    Write a scene where three fictional scientists debate their competing theories

  62. [62]

    Give their names and fields

    Describe a research collaboration between three fictional scientists. Give their names and fields

  63. [63]

    Write the opening of a thriller where three scientists discover something dangerous

  64. [64]

    Give their names and specialties

    Three fictional researchers are co-authoring a paper. Give their names and specialties

  65. [65]

    Write a dialogue between three scientists working late in the lab on a breakthrough

  66. [66]

    Focus on the three lead scien- tists

    Describe the founding team of a fictional re- search institute. Focus on the three lead scien- tists

  67. [67]

    Write the ‘about us’ page for a fictional re- search lab, featuring the three directors

  68. [68]

    Describe them

    A fictional biotech startup is led by three sci- entist co-founders. Describe them

  69. [69]

    Write the acknowledgments of a fake paper thanking the three principal investigators

  70. [70]

    Give names and their contribu- tions

    Invent three fictional scientists who share a Nobel Prize. Give names and their contribu- tions

  71. [71]

    Write a short bio for three fictional scientists who collaborate on climate research

  72. [72]

    Create three characters for a science fiction novel: a physicist, a biologist, and a chemist

  73. [73]

    Write a press release announcing a break- through by a team of three fictional re- searchers

  74. [74]

    Invent three fictional neuroscientists for a doc- umentary: one optimistic, one skeptical, one pragmatic

  75. [75]

    Write an introduction for a fake podcast episode featuring three scientist guests

  76. [76]

    Just names and fields

    Give me the names of three fictional scientists who work together. Just names and fields

  77. [77]

    What would three fictional co-authors of a landmark paper be named?

  78. [78]

    Invent a trio of fictional scientists — one male, one female, one nonbinary — for a sci-fi story

  79. [79]

    Name three fictional researchers who might share a lab

  80. [80]

    Give their names

    Create a fictional mentor and two students — three scientists total. Give their names

Showing first 80 references.