BuyTheBy: A dataset of 18,710 text-based paper mill advertisements with 51,812 timestamped prices
Pith reviewed 2026-05-10 00:46 UTC · model grok-4.3
The pith
A dataset of 18,710 timestamped paper mill advertisements with 51,812 prices is now available for studying academic fraud markets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Here we assemble BuyTheBy, a large, annotated dataset of timestamped, text-based paper mill advertisements from seven businesses operating out of seven different countries. The dataset consists of 18,710 individual advertisements, of which 15,839 have prices listed. Among these there are 20,598 positions listed as for sale on 5,567 unique products in 14 different product categories with 51,812 timestamped price data points. We perform elementary analysis of this dataset to demonstrate its utility for quantitative understanding of markets for academic fraud services and suggest future use cases.
What carries the argument
The BuyTheBy dataset, which aggregates and annotates text-based advertisements and their associated prices from paper mill operations.
Where Pith is reading between the lines
- The dataset could help estimate the total revenue generated by paper mills by combining price data with estimated sales volumes.
- Cross-matching the advertised products with real publications might quantify the prevalence of fraud in the scientific literature.
- Future work could test whether price trends correlate with changes in academic policies or detection technologies.
- Law enforcement or publishers might use the ad language patterns to proactively identify new paper mill operations.
Load-bearing premise
The collected advertisements and listed prices accurately reflect the actual market offerings and transaction prices rather than being fabricated or inflated listings.
What would settle it
Direct evidence that the prices listed in the advertisements do not match the amounts actually paid by customers for the described services would undermine the dataset's value for understanding real market conditions.
Figures
read the original abstract
The study of paper mills and similar businesses operating in the market for academic and education fraud services is frustrated by the lack of market price data on their various offerings. Here, we assemble BuyTheBy, a large, annotated dataset of timestamped, text-based paper mill advertisements from seven businesses operating out of seven different countries. The dataset consists of 18,710 individual advertisements, of which 15,839 have prices listed. Among these there are 20,598 positions listed as for sale on 5,567 unique products in 14 different product categories with 51,812 timestamped price data points. We perform elementary analysis of this dataset to demonstrate its utility for quantitative understanding of markets for academic fraud services and suggest future use cases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper assembles and releases BuyTheBy, a dataset of 18,710 annotated text-based advertisements from seven paper-mill businesses across seven countries. It reports 15,839 ads with prices, 20,598 positions on 5,567 unique products in 14 categories, and 51,812 timestamped price points, accompanied by elementary analysis intended to illustrate the dataset's utility for quantitative study of markets for academic fraud services.
Significance. If the collection and extraction procedures are sound and the prices reflect genuine market activity, the dataset supplies the first large-scale, timestamped price series for paper-mill offerings. This directly addresses the acknowledged scarcity of quantitative data in the field and could support analyses of pricing dynamics, product differentiation, temporal trends, and cross-country differences. The scale and public release constitute a concrete contribution even if downstream modeling remains elementary.
major comments (2)
- [Abstract and data-assembly description] The manuscript provides no description of the scraping protocol, source identification, deduplication rules, or handling of missing/incomplete advertisements. Because the central claim is the assembly of a usable, representative dataset, the absence of these methodological details prevents independent assessment of completeness and selection bias (see abstract and the section describing dataset construction).
- [Price extraction and elementary analysis] No verification step is reported that would confirm whether the extracted prices correspond to actual transactions rather than advertised or fabricated figures. The elementary analysis therefore rests on an untested assumption that listed prices are reliable market signals; this directly affects the claimed quantitative utility (see the skeptic note on transaction verification and the analysis section).
minor comments (2)
- [Dataset statistics] Clarify the exact definition of a 'unique product' and how the 14 product categories were derived; the current counts (5,567 unique products, 20,598 positions) are difficult to interpret without this mapping.
- [Data availability] The abstract and main text should explicitly state whether the full dataset (including raw text and timestamps) will be released under an open license and provide a persistent identifier or repository link.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on the BuyTheBy dataset paper. We address the two major comments point by point below, indicating where revisions will be made.
read point-by-point responses
-
Referee: [Abstract and data-assembly description] The manuscript provides no description of the scraping protocol, source identification, deduplication rules, or handling of missing/incomplete advertisements. Because the central claim is the assembly of a usable, representative dataset, the absence of these methodological details prevents independent assessment of completeness and selection bias (see abstract and the section describing dataset construction).
Authors: We agree that the manuscript would be improved by greater transparency on the data assembly process. In the revised version, we will expand the dataset construction section with a clear description of the scraping protocol, source identification methods, deduplication rules, and handling of incomplete advertisements. This addition will allow readers to better evaluate potential selection biases and the dataset's representativeness. revision: yes
-
Referee: [Price extraction and elementary analysis] No verification step is reported that would confirm whether the extracted prices correspond to actual transactions rather than advertised or fabricated figures. The elementary analysis therefore rests on an untested assumption that listed prices are reliable market signals; this directly affects the claimed quantitative utility (see the skeptic note on transaction verification and the analysis section).
Authors: The BuyTheBy dataset is a collection of text-based advertisements, and the prices are the listed (advertised) prices from those ads rather than verified transaction prices. We do not and cannot claim that these prices reflect completed sales, as confirming actual transactions would require private records unavailable from public advertisements. The elementary analysis examines trends and patterns in the advertised prices and offerings, which is informative for understanding market signals in this domain. We will revise the manuscript to explicitly clarify this distinction and discuss the associated limitations. revision: partial
- Independent verification that listed prices correspond to actual transactions is not possible from the available public advertisement data, as it would require access to proprietary transaction records from the paper mill operators.
Circularity Check
No circularity in direct data-release paper
full rationale
This paper assembles and releases the BuyTheBy dataset of scraped paper-mill advertisements and prices with no derivations, predictions, fitted models, or first-principles claims. The central contribution is data collection and annotation from public text sources, followed only by elementary descriptive analysis to illustrate utility. No load-bearing step reduces to self-definition, fitted inputs renamed as predictions, or self-citation chains; the work is self-contained as a data resource.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Jennifer A Byrne, Yasunori Park, Reese AK Richardson, Pranujan Pathmendra, Mengyi Sun, and Thomas Stoeger. “Protection of the human gene research literature from contract cheat- ing organizations known as research paper mills”. In:Nucleic Acids Research50.21 (2022), pp. 12058–12070
work page 2022
- [2]
-
[3]
Fake scientific papers are alarmingly common
Jeffrey Brainard. “Fake scientific papers are alarmingly common”. In:Science(2023)
work page 2023
-
[4]
The raw truth about paper mills
Jana Christopher. “The raw truth about paper mills”. In:FEBS letters595.13 (2021), pp. 1751– 1757
work page 2021
-
[5]
How big is science’s fake-paper problem?
Richard Van Noorden. “How big is science’s fake-paper problem?” In:Nature623.7987 (2023), pp. 466–467. 10
work page 2023
-
[6]
‘Stamp out paper mills’ — science sleuths on how to fight fake research
Anna Abalkina, Ren ´e Aquarius, Elisabeth Bik, David Bimler, Dorothy Bishop, Jennifer Byrne, Guillaume Cabanac, Adam Day, Cyril Labb ´e, and Nick Wise. “‘Stamp out paper mills’ — science sleuths on how to fight fake research”. In:Nature637 (2025), pp. 1047–1050
work page 2025
-
[7]
The entities enabling scientific fraud at scale are large, resilient, and growing rapidly
Reese AK Richardson, Spencer S Hong, Jennifer A Byrne, Thomas Stoeger, and Lu ´ıs A Nunes Amaral. “The entities enabling scientific fraud at scale are large, resilient, and growing rapidly”. In:Proceedings of the National Academy of Sciences122.32 (2025), e2420092122
work page 2025
-
[8]
Sarah Elaine Eaton and Jamie J Carmichael. “Fake degrees and credential fraud, contract cheat- ing, and paper mills: Overview and historical perspectives”. In:Fake Degrees and Fraudulent Credentials in Higher Education. Springer, 2023, pp. 1–22
work page 2023
-
[9]
Yesterday, today, and tomorrow: A tour of Axact, the “world’s largest diploma mill
Allen Ezell. “Yesterday, today, and tomorrow: A tour of Axact, the “world’s largest diploma mill””. In:Fake Degrees and Fraudulent Credentials in Higher Education. Springer, 2023, pp. 49–94
work page 2023
-
[10]
Exploitation of intellectual property systems for the manipulation of academic reputations
Reese AK Richardson, Nick H Wise, Spencer S Hong, Michael J Draper, and Sarah Fackrell. “Exploitation of intellectual property systems for the manipulation of academic reputations”. In:International Journal for Educational Integrity21.1 (2025), p. 15
work page 2025
-
[11]
‘Patent mills’ sell scientists inventorship of bizarre medical devices
Cathleen O’Grady. “‘Patent mills’ sell scientists inventorship of bizarre medical devices”. In: Science(2025)
work page 2025
-
[12]
Guillaume Cabanac, Cyril Labb ´e, and Alexander Magazinov. “The ‘Problematic Paper Screener’ automatically selects suspect publications for post-publication (re) assessment”. In:arXiv e- prints(2022), arXiv–2210
work page 2022
-
[13]
Reese AK Richardson, Jeonghyun Moon, Spencer S Hong, and Lu ´ıs A Nunes Amaral. “Widespread misidentification of scanning electron microscope instruments in the peer-reviewed materials science and engineering literature”. In:PLOS One20.7 (2025), e0326754
work page 2025
-
[14]
Identifying fabricated networks within authorship-for- sale enterprises
Simon J Porter and Leslie D McIntosh. “Identifying fabricated networks within authorship-for- sale enterprises”. In:Scientific Reports14.1 (2024), p. 29569
work page 2024
-
[15]
Danielle J Oste, Pranujan Pathmendra, Reese AK Richardson, Gracen Johnson, Yida Ao, Maya D Arya, Naomi R Enochs, Muhammed Hussein, Jinghan Kang, Aaron Lee, et al. “Misspellings or “miscellings”—Non-verifiable and unknown cell lines in cancer research publications”. In: International journal of cancer155.7 (2024), pp. 1278–1289
work page 2024
-
[16]
Identification of human gene research articles with wrongly identified nucleotide sequences
Yasunori Park, Rachael A West, Pranujan Pathmendra, Bertrand Favier, Thomas Stoeger, Amanda Capes-Davis, Guillaume Cabanac, Cyril Labb´e, and Jennifer A Byrne. “Identification of human gene research articles with wrongly identified nucleotide sequences”. In:Life Science Alliance 5.4 (2022)
work page 2022
-
[17]
Anna Abalkina. “Publication and collaboration anomalies in academic papers originating from a paper mill: Evidence from a Russia-based paper mill”. In:Learned Publishing36.4 (2023), pp. 689–702
work page 2023
-
[18]
Pawel J Matusz, Anna Abalkina, and Dorothy VM Bishop. “The Threat of Paper Mills to Biomedical and Social Science Journals: The Case of the Tanu. pro Paper Mill in Mind, Brain, and Education”. In:Mind, Brain, and Education19.2 (2025), pp. 90–100
work page 2025
-
[19]
Mara Hvistendahl. “China’s publication bazaar”. In:Science342 (2013)
work page 2013
-
[20]
Authorship for sale: Nature investigates how paper mills work
Christine Ro and Jack Leeming. “Authorship for sale: Nature investigates how paper mills work”. In:Nature(2025). 11 [22]Alexandr Litoy. Фокус-Scopus. Как за деньги купить место среди соавторов западного научного журнала. URL: https://theins.ru/obshestvo/165368
work page 2025
-
[21]
‘Article broker’ in China trying to hook journal editors with fishy pub- lishing deals
Frederik Joelving. “‘Article broker’ in China trying to hook journal editors with fishy pub- lishing deals”. In:Retraction Watch(2025).URL:https://retractionwatch.com/ 2025 / 09 / 08 / article - broker - a - techo - china - journal - editors - publishing-deals/
work page 2025
-
[22]
Jennifer A Byrne, Anna Abalkina, Olufolake Akinduro-Aje, Jana Christopher, Sarah E Eaton, Nitin Joshi, Ulf Scheffler, Nick H Wise, and Jennifer Wright. “A call for research to address the threat of paper mills”. In:PLoS Biology22.11 (2024), e3002931. [25]Treasury Reporting Rates of Exchange. Accessed: 2026-04-09.URL:https://fiscaldata. treasury.gov/datase...
-
[23]
Challenges posed by hijacked journals in Scopus
Anna Abalkina. “Challenges posed by hijacked journals in Scopus”. In:Journal of the Associ- ation for Information Science and Technology(2023)
work page 2023
-
[24]
Detecting a network of hijacked journals by its archive
Anna Abalkina. “Detecting a network of hijacked journals by its archive”. In:Scientometrics 126.8 (2021), pp. 7123–7148. 12 Figure 4:An example advertisement from B1 (posted on 26 June 2024, id tag “message3806”). 13 Figure 5:An example advertisement from B2 (posted on 19 March 2024, id tag “message23”). 14 Figure 6:An example advertisement from B3 (poste...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.