pith. sign in

arxiv: 2604.24576 · v1 · submitted 2026-04-21 · 💻 cs.CY

BuyTheBy: A dataset of 18,710 text-based paper mill advertisements with 51,812 timestamped prices

Pith reviewed 2026-05-10 00:46 UTC · model grok-4.3

classification 💻 cs.CY
keywords paper millsacademic fraudresearch integritydatasetprice dataadvertisementsscientific publishingfraud services
0
0 comments X

The pith

A dataset of 18,710 timestamped paper mill advertisements with 51,812 prices is now available for studying academic fraud markets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compiles BuyTheBy, a dataset of 18,710 text-based advertisements from seven paper mill businesses operating in seven countries. It includes 15,839 ads with prices, listing 20,598 positions across 5,567 unique products in 14 categories and providing 51,812 timestamped price points. This fills a gap in data that has hindered quantitative research on markets for services like fake papers and degrees. Sympathetic readers would value it because it allows tracking of prices, products, and market trends in academic fraud. The authors include basic analysis to show how the data can be used and propose additional applications.

Core claim

Here we assemble BuyTheBy, a large, annotated dataset of timestamped, text-based paper mill advertisements from seven businesses operating out of seven different countries. The dataset consists of 18,710 individual advertisements, of which 15,839 have prices listed. Among these there are 20,598 positions listed as for sale on 5,567 unique products in 14 different product categories with 51,812 timestamped price data points. We perform elementary analysis of this dataset to demonstrate its utility for quantitative understanding of markets for academic fraud services and suggest future use cases.

What carries the argument

The BuyTheBy dataset, which aggregates and annotates text-based advertisements and their associated prices from paper mill operations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dataset could help estimate the total revenue generated by paper mills by combining price data with estimated sales volumes.
  • Cross-matching the advertised products with real publications might quantify the prevalence of fraud in the scientific literature.
  • Future work could test whether price trends correlate with changes in academic policies or detection technologies.
  • Law enforcement or publishers might use the ad language patterns to proactively identify new paper mill operations.

Load-bearing premise

The collected advertisements and listed prices accurately reflect the actual market offerings and transaction prices rather than being fabricated or inflated listings.

What would settle it

Direct evidence that the prices listed in the advertisements do not match the amounts actually paid by customers for the described services would undermine the dataset's value for understanding real market conditions.

Figures

Figures reproduced from arXiv: 2604.24576 by Anna Abalkina, Reese AK Richardson, Spencer S Hong.

Figure 1
Figure 1. Figure 1: The median listed price for sole authorship, first authorship and sole authorship of a single chapter of an “international” textbook changed over time in advertisements posted by B1. Prices were converted from INR to USD at a static exchange rate of 0.012 USD to 1 INR. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The median listed price for authorship positions on IEEE conference proceeding articles changed over time in advertisements posted by B1. Prices were converted from INR to USD at a static exchange rate of 0.012 USD to 1 INR. This dataset also allows for comparison of prices for similar products in different markets, as demonstrated for authorship positions on academic articles among the seven businesses in… view at source ↗
Figure 3
Figure 3. Figure 3: Distributions of prices for authorship positions on academic articles advertised by each business. Only the most recently-advertised price for a given authorship position is included and only authorship slots one through five are shown. Boxplots show median as a horizontal line, interquartile range as boxes, 2.5th and 97.5th percentiles as whiskers, and outliers as diamonds. Prices were converted to USD ba… view at source ↗
Figure 4
Figure 4. Figure 4: An example advertisement from B1 (posted on 26 June 2024, id tag “message3806”). 13 [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: An example advertisement from B2 (posted on 19 March 2024, id tag “message23”). 14 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example advertisement from B3 (posted on 21 March 2025, id tag “message7180”). 15 [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An example advertisement from B4 (archived on 11 November 2023, id tag “231111 176”). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: An example advertisement from B5’s “publication service agreement” pages (archived on 1 November 2021, id tag “1439.3”) [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An example advertisement from B5’s new wewsbite pages (archived on 7 August 2025, id tag “4063 250807”). 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An example advertisement from B6 (archived on 21 March 2026, id tag “260321 Педа￾гогика/Образование 43”). 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: An example advertisement from B7 (archived on 21 March 2026, id tag ‘260321 Eco￾nomic Sciences 5”). 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distributions of prices for authorship positions on academic articles advertised by B1. Only the most recently-advertised price for a given authorship position is included. Boxplots show median as a horizontal line, interquartile range as boxes, 2.5th and 97.5th percentiles as whiskers, and outliers as diamonds [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Distributions of prices for authorship positions on academic articles advertised by B2. Only the most recently-advertised price for a given authorship position is included. Boxplots show median as a horizontal line, interquartile range as boxes, 2.5th and 97.5th percentiles as whiskers, and outliers as diamonds. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Distributions of prices for authorship positions on academic articles advertised by B3. Only the most recently-advertised price for a given authorship position is included. Boxplots show median as a horizontal line, interquartile range as boxes, 2.5th and 97.5th percentiles as whiskers, and outliers as diamonds [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Distributions of prices for authorship positions on academic articles advertised by B4. Only the most recently-advertised price for a given authorship position is included. Boxplots show median as a horizontal line, interquartile range as boxes, 2.5th and 97.5th percentiles as whiskers, and outliers as diamonds. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Distributions of prices for authorship positions on academic articles advertised by B5. Only the most recently-advertised price for a given authorship position is included. Boxplots show median as a horizontal line, interquartile range as boxes, 2.5th and 97.5th percentiles as whiskers, and outliers as diamonds [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Distributions of prices for authorship positions on academic articles advertised by B6. Only the most recently-advertised price for a given authorship position is included. Boxplots show median as a horizontal line, interquartile range as boxes, 2.5th and 97.5th percentiles as whiskers, and outliers as diamonds. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Distributions of prices for authorship positions on academic articles advertised by B7. Only the most recently-advertised price for a given authorship position is included. Boxplots show median as a horizontal line, interquartile range as boxes, 2.5th and 97.5th percentiles as whiskers, and outliers as diamonds. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_18.png] view at source ↗
read the original abstract

The study of paper mills and similar businesses operating in the market for academic and education fraud services is frustrated by the lack of market price data on their various offerings. Here, we assemble BuyTheBy, a large, annotated dataset of timestamped, text-based paper mill advertisements from seven businesses operating out of seven different countries. The dataset consists of 18,710 individual advertisements, of which 15,839 have prices listed. Among these there are 20,598 positions listed as for sale on 5,567 unique products in 14 different product categories with 51,812 timestamped price data points. We perform elementary analysis of this dataset to demonstrate its utility for quantitative understanding of markets for academic fraud services and suggest future use cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper assembles and releases BuyTheBy, a dataset of 18,710 annotated text-based advertisements from seven paper-mill businesses across seven countries. It reports 15,839 ads with prices, 20,598 positions on 5,567 unique products in 14 categories, and 51,812 timestamped price points, accompanied by elementary analysis intended to illustrate the dataset's utility for quantitative study of markets for academic fraud services.

Significance. If the collection and extraction procedures are sound and the prices reflect genuine market activity, the dataset supplies the first large-scale, timestamped price series for paper-mill offerings. This directly addresses the acknowledged scarcity of quantitative data in the field and could support analyses of pricing dynamics, product differentiation, temporal trends, and cross-country differences. The scale and public release constitute a concrete contribution even if downstream modeling remains elementary.

major comments (2)
  1. [Abstract and data-assembly description] The manuscript provides no description of the scraping protocol, source identification, deduplication rules, or handling of missing/incomplete advertisements. Because the central claim is the assembly of a usable, representative dataset, the absence of these methodological details prevents independent assessment of completeness and selection bias (see abstract and the section describing dataset construction).
  2. [Price extraction and elementary analysis] No verification step is reported that would confirm whether the extracted prices correspond to actual transactions rather than advertised or fabricated figures. The elementary analysis therefore rests on an untested assumption that listed prices are reliable market signals; this directly affects the claimed quantitative utility (see the skeptic note on transaction verification and the analysis section).
minor comments (2)
  1. [Dataset statistics] Clarify the exact definition of a 'unique product' and how the 14 product categories were derived; the current counts (5,567 unique products, 20,598 positions) are difficult to interpret without this mapping.
  2. [Data availability] The abstract and main text should explicitly state whether the full dataset (including raw text and timestamps) will be released under an open license and provide a persistent identifier or repository link.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive comments on the BuyTheBy dataset paper. We address the two major comments point by point below, indicating where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract and data-assembly description] The manuscript provides no description of the scraping protocol, source identification, deduplication rules, or handling of missing/incomplete advertisements. Because the central claim is the assembly of a usable, representative dataset, the absence of these methodological details prevents independent assessment of completeness and selection bias (see abstract and the section describing dataset construction).

    Authors: We agree that the manuscript would be improved by greater transparency on the data assembly process. In the revised version, we will expand the dataset construction section with a clear description of the scraping protocol, source identification methods, deduplication rules, and handling of incomplete advertisements. This addition will allow readers to better evaluate potential selection biases and the dataset's representativeness. revision: yes

  2. Referee: [Price extraction and elementary analysis] No verification step is reported that would confirm whether the extracted prices correspond to actual transactions rather than advertised or fabricated figures. The elementary analysis therefore rests on an untested assumption that listed prices are reliable market signals; this directly affects the claimed quantitative utility (see the skeptic note on transaction verification and the analysis section).

    Authors: The BuyTheBy dataset is a collection of text-based advertisements, and the prices are the listed (advertised) prices from those ads rather than verified transaction prices. We do not and cannot claim that these prices reflect completed sales, as confirming actual transactions would require private records unavailable from public advertisements. The elementary analysis examines trends and patterns in the advertised prices and offerings, which is informative for understanding market signals in this domain. We will revise the manuscript to explicitly clarify this distinction and discuss the associated limitations. revision: partial

standing simulated objections not resolved
  • Independent verification that listed prices correspond to actual transactions is not possible from the available public advertisement data, as it would require access to proprietary transaction records from the paper mill operators.

Circularity Check

0 steps flagged

No circularity in direct data-release paper

full rationale

This paper assembles and releases the BuyTheBy dataset of scraped paper-mill advertisements and prices with no derivations, predictions, fitted models, or first-principles claims. The central contribution is data collection and annotation from public text sources, followed only by elementary descriptive analysis to illustrate utility. No load-bearing step reduces to self-definition, fitted inputs renamed as predictions, or self-citation chains; the work is self-contained as a data resource.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a descriptive dataset paper with no mathematical derivations, fitted parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5433 in / 1096 out tokens · 37499 ms · 2026-05-10T00:46:11.908068+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Protection of the human gene research literature from contract cheat- ing organizations known as research paper mills

    Jennifer A Byrne, Yasunori Park, Reese AK Richardson, Pranujan Pathmendra, Mengyi Sun, and Thomas Stoeger. “Protection of the human gene research literature from contract cheat- ing organizations known as research paper mills”. In:Nucleic Acids Research50.21 (2022), pp. 12058–12070

  2. [2]

    Paper trail

    Frederik Joelving. “Paper trail”. In:Science383.6680 (2024), pp. 252–255

  3. [3]

    Fake scientific papers are alarmingly common

    Jeffrey Brainard. “Fake scientific papers are alarmingly common”. In:Science(2023)

  4. [4]

    The raw truth about paper mills

    Jana Christopher. “The raw truth about paper mills”. In:FEBS letters595.13 (2021), pp. 1751– 1757

  5. [5]

    How big is science’s fake-paper problem?

    Richard Van Noorden. “How big is science’s fake-paper problem?” In:Nature623.7987 (2023), pp. 466–467. 10

  6. [6]

    ‘Stamp out paper mills’ — science sleuths on how to fight fake research

    Anna Abalkina, Ren ´e Aquarius, Elisabeth Bik, David Bimler, Dorothy Bishop, Jennifer Byrne, Guillaume Cabanac, Adam Day, Cyril Labb ´e, and Nick Wise. “‘Stamp out paper mills’ — science sleuths on how to fight fake research”. In:Nature637 (2025), pp. 1047–1050

  7. [7]

    The entities enabling scientific fraud at scale are large, resilient, and growing rapidly

    Reese AK Richardson, Spencer S Hong, Jennifer A Byrne, Thomas Stoeger, and Lu ´ıs A Nunes Amaral. “The entities enabling scientific fraud at scale are large, resilient, and growing rapidly”. In:Proceedings of the National Academy of Sciences122.32 (2025), e2420092122

  8. [8]

    Fake degrees and credential fraud, contract cheat- ing, and paper mills: Overview and historical perspectives

    Sarah Elaine Eaton and Jamie J Carmichael. “Fake degrees and credential fraud, contract cheat- ing, and paper mills: Overview and historical perspectives”. In:Fake Degrees and Fraudulent Credentials in Higher Education. Springer, 2023, pp. 1–22

  9. [9]

    Yesterday, today, and tomorrow: A tour of Axact, the “world’s largest diploma mill

    Allen Ezell. “Yesterday, today, and tomorrow: A tour of Axact, the “world’s largest diploma mill””. In:Fake Degrees and Fraudulent Credentials in Higher Education. Springer, 2023, pp. 49–94

  10. [10]

    Exploitation of intellectual property systems for the manipulation of academic reputations

    Reese AK Richardson, Nick H Wise, Spencer S Hong, Michael J Draper, and Sarah Fackrell. “Exploitation of intellectual property systems for the manipulation of academic reputations”. In:International Journal for Educational Integrity21.1 (2025), p. 15

  11. [11]

    ‘Patent mills’ sell scientists inventorship of bizarre medical devices

    Cathleen O’Grady. “‘Patent mills’ sell scientists inventorship of bizarre medical devices”. In: Science(2025)

  12. [12]

    The ‘Problematic Paper Screener’ automatically selects suspect publications for post-publication (re) assessment

    Guillaume Cabanac, Cyril Labb ´e, and Alexander Magazinov. “The ‘Problematic Paper Screener’ automatically selects suspect publications for post-publication (re) assessment”. In:arXiv e- prints(2022), arXiv–2210

  13. [13]

    Widespread misidentification of scanning electron microscope instruments in the peer-reviewed materials science and engineering literature

    Reese AK Richardson, Jeonghyun Moon, Spencer S Hong, and Lu ´ıs A Nunes Amaral. “Widespread misidentification of scanning electron microscope instruments in the peer-reviewed materials science and engineering literature”. In:PLOS One20.7 (2025), e0326754

  14. [14]

    Identifying fabricated networks within authorship-for- sale enterprises

    Simon J Porter and Leslie D McIntosh. “Identifying fabricated networks within authorship-for- sale enterprises”. In:Scientific Reports14.1 (2024), p. 29569

  15. [15]

    Misspellings or “miscellings

    Danielle J Oste, Pranujan Pathmendra, Reese AK Richardson, Gracen Johnson, Yida Ao, Maya D Arya, Naomi R Enochs, Muhammed Hussein, Jinghan Kang, Aaron Lee, et al. “Misspellings or “miscellings”—Non-verifiable and unknown cell lines in cancer research publications”. In: International journal of cancer155.7 (2024), pp. 1278–1289

  16. [16]

    Identification of human gene research articles with wrongly identified nucleotide sequences

    Yasunori Park, Rachael A West, Pranujan Pathmendra, Bertrand Favier, Thomas Stoeger, Amanda Capes-Davis, Guillaume Cabanac, Cyril Labb´e, and Jennifer A Byrne. “Identification of human gene research articles with wrongly identified nucleotide sequences”. In:Life Science Alliance 5.4 (2022)

  17. [17]

    Publication and collaboration anomalies in academic papers originating from a paper mill: Evidence from a Russia-based paper mill

    Anna Abalkina. “Publication and collaboration anomalies in academic papers originating from a paper mill: Evidence from a Russia-based paper mill”. In:Learned Publishing36.4 (2023), pp. 689–702

  18. [18]

    The Threat of Paper Mills to Biomedical and Social Science Journals: The Case of the Tanu. pro Paper Mill in Mind, Brain, and Education

    Pawel J Matusz, Anna Abalkina, and Dorothy VM Bishop. “The Threat of Paper Mills to Biomedical and Social Science Journals: The Case of the Tanu. pro Paper Mill in Mind, Brain, and Education”. In:Mind, Brain, and Education19.2 (2025), pp. 90–100

  19. [19]

    China’s publication bazaar

    Mara Hvistendahl. “China’s publication bazaar”. In:Science342 (2013)

  20. [20]

    Authorship for sale: Nature investigates how paper mills work

    Christine Ro and Jack Leeming. “Authorship for sale: Nature investigates how paper mills work”. In:Nature(2025). 11 [22]Alexandr Litoy. Фокус-Scopus. Как за деньги купить место среди соавторов западного научного журнала. URL: https://theins.ru/obshestvo/165368

  21. [21]

    ‘Article broker’ in China trying to hook journal editors with fishy pub- lishing deals

    Frederik Joelving. “‘Article broker’ in China trying to hook journal editors with fishy pub- lishing deals”. In:Retraction Watch(2025).URL:https://retractionwatch.com/ 2025 / 09 / 08 / article - broker - a - techo - china - journal - editors - publishing-deals/

  22. [22]

    librosa/librosa: 0.6.3,

    Jennifer A Byrne, Anna Abalkina, Olufolake Akinduro-Aje, Jana Christopher, Sarah E Eaton, Nitin Joshi, Ulf Scheffler, Nick H Wise, and Jennifer Wright. “A call for research to address the threat of paper mills”. In:PLoS Biology22.11 (2024), e3002931. [25]Treasury Reporting Rates of Exchange. Accessed: 2026-04-09.URL:https://fiscaldata. treasury.gov/datase...

  23. [23]

    Challenges posed by hijacked journals in Scopus

    Anna Abalkina. “Challenges posed by hijacked journals in Scopus”. In:Journal of the Associ- ation for Information Science and Technology(2023)

  24. [24]

    Detecting a network of hijacked journals by its archive

    Anna Abalkina. “Detecting a network of hijacked journals by its archive”. In:Scientometrics 126.8 (2021), pp. 7123–7148. 12 Figure 4:An example advertisement from B1 (posted on 26 June 2024, id tag “message3806”). 13 Figure 5:An example advertisement from B2 (posted on 19 March 2024, id tag “message23”). 14 Figure 6:An example advertisement from B3 (poste...