MajinBook: An open catalogue of digitally mediated world literature
Pith reviewed 2026-05-17 21:53 UTC · model grok-4.3
The pith
Linking shadow library metadata with Goodreads creates a high-precision corpus of over 539,000 digitally mediated English books.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, the authors create a high-precision corpus of over 539,000 references to digitally mediated English-language books. Spanning three centuries and reflecting a contemporary selection bias, these entries are enriched with first publication dates, genres, and popularity metrics like ratings and reviews. The methodology prioritises natively digital EPUB files to ensure machine-readable quality while addressing biases in traditional corpora.
What carries the argument
The linkage strategy between shadow-library metadata and Goodreads records that produces high-precision matches for the corpus of digitally mediated books.
If this is right
- The corpus supports computational social science and cultural analytics on a large scale using digital book data.
- Researchers gain access to machine-readable EPUB files that avoid artifacts from scanned texts.
- Analysis of selection biases and popularity trends becomes possible across three centuries of English-language literature.
- Secondary datasets enable parallel studies for French, German, and Spanish books.
- Open data release allows other teams to replicate or extend the catalogue for their own text-mining projects.
Where Pith is reading between the lines
- The catalogue could be used to track how digital availability influences which older books remain widely read today.
- Similar linkage techniques might be applied to other cultural datasets such as film or music archives to create comparable research resources.
- Periodic updates to the corpus could incorporate newly added titles from the source libraries and reflect evolving popularity signals.
Load-bearing premise
The linkage strategy between shadow-library metadata and Goodreads records produces high-precision matches, and the resulting corpus can be used legally for text and data mining under EU and US research frameworks.
What would settle it
A manual audit of a random sample of linked entries that finds a high rate of incorrect matches or a legal ruling that prohibits research use of data derived from shadow libraries.
Figures
read the original abstract
This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries-such as Library Genesis and Z-Library-for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to digitally mediated English-language books. Spanning three centuries and reflecting a contemporary selection bias, these entries are enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritises natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project's legal permissibility under EU and US frameworks for text and data mining in research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MajinBook, an open catalogue linking crowd-sourced metadata from shadow libraries (Library Genesis and Z-Library) with Goodreads bibliographic records to produce a corpus of over 539,000 references to digitally mediated English-language books. The entries span three centuries, incorporate first publication dates, genres, and popularity metrics, prioritize EPUB files for machine readability, and are accompanied by secondary datasets for French, German, and Spanish. The work claims to evaluate linkage accuracy, releases all data openly, and addresses legal permissibility for text and data mining under EU and US research frameworks.
Significance. If the linkage produces demonstrably high-precision matches, MajinBook would constitute a useful resource for computational social science and cultural analytics by supplying a large-scale, contemporary, machine-readable alternative to corpora such as HathiTrust that mitigates certain selection biases. The open release of the full underlying data and the provision of multilingual secondary datasets are concrete strengths that support reuse and extension.
major comments (2)
- [Methods] Methods / Linkage Evaluation: The central claim of a 'high-precision corpus' of 539,000 references rests on the linkage between noisy crowd-sourced sources, yet the manuscript supplies no description of the matching algorithm (ISBN, fuzzy title/author, or hybrid), no precision/recall/F1 figures, and no account of the gold-standard validation set or protocol used to assess accuracy.
- [Results] Results / Corpus Size: The headline quantitative result (over 539,000 references) is presented without accompanying error-rate estimates or false-positive analysis; because both input archives contain variant titles, missing fields, and crowd-sourced noise, the absence of these metrics leaves the reliability of the reported corpus size and composition unverified.
minor comments (2)
- [Legal Discussion] The legal-permissibility discussion would benefit from explicit citations to relevant EU DSM Directive articles or US fair-use precedents rather than general statements of framework compatibility.
- [Figures and Tables] Figure captions and table headings should explicitly state the exact matching criteria and any filtering thresholds applied during corpus construction.
Simulated Author's Rebuttal
We thank the referee for their thorough review and constructive feedback on the linkage methodology and quantitative reliability of the MajinBook corpus. We address each major comment below and will revise the manuscript to incorporate the requested details and metrics.
read point-by-point responses
-
Referee: [Methods] Methods / Linkage Evaluation: The central claim of a 'high-precision corpus' of 539,000 references rests on the linkage between noisy crowd-sourced sources, yet the manuscript supplies no description of the matching algorithm (ISBN, fuzzy title/author, or hybrid), no precision/recall/F1 figures, and no account of the gold-standard validation set or protocol used to assess accuracy.
Authors: We acknowledge that the current manuscript does not provide a detailed description of the matching algorithm, performance metrics, or validation protocol, even though it states that the linkage strategy was evaluated for accuracy. This is a substantive gap that weakens the claim of a high-precision corpus. We will add a new subsection to the Methods section that fully specifies the linkage procedure (a hybrid approach using ISBN matching where available, supplemented by fuzzy title/author matching with defined similarity thresholds), the construction and size of the gold-standard validation set, the annotation protocol, and the resulting precision, recall, and F1 scores. We will also include an error analysis of the main failure modes observed during validation. revision: yes
-
Referee: [Results] Results / Corpus Size: The headline quantitative result (over 539,000 references) is presented without accompanying error-rate estimates or false-positive analysis; because both input archives contain variant titles, missing fields, and crowd-sourced noise, the absence of these metrics leaves the reliability of the reported corpus size and composition unverified.
Authors: We agree that the reported corpus size of 539,000+ references must be accompanied by quantitative error-rate estimates and false-positive analysis to allow readers to assess reliability, particularly given the known noise in the source metadata. Although the manuscript claims an evaluation of linkage accuracy, the Results section currently lacks these supporting figures. We will revise the Results section to report the estimated false-positive rate (and overall error rate) derived from the validation set, along with a brief discussion of how these rates affect the final corpus size and genre/popularity distributions. This addition will directly address the concern about unverified reliability. revision: yes
Circularity Check
No circularity: data-construction paper with no derivations or fitted quantities
full rationale
This paper introduces and releases a new catalogue by linking existing crowd-sourced metadata sources (shadow libraries and Goodreads). The central output is the corpus itself rather than any derived quantity, prediction, or first-principles result. No equations, fitted parameters, self-definitional steps, or load-bearing self-citations appear in the provided text or abstract. The linkage claim is presented as a methodological choice whose accuracy is asserted to have been evaluated, but this evaluation is external to any internal derivation chain and does not reduce the output to the inputs by construction. The paper is therefore self-contained as a data release.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
The history of corpus linguistics,
T . McEnery and A. Hardie, “The history of corpus linguistics,” in The Oxford handbook of the history of linguistics(K. Allan, ed.), Ox- ford University Press, 2013. 8
work page 2013
-
[2]
Quanti- tative analysis of culture using millions of digitized books,
J.-B. Michel, Y. K. Shen, A. P . Aiden, A. Veres, M. K. Gray, G. B. Team, J. P . Pickett, D. Hoiberg, D. Clancy, P . Norvig,et al., “Quanti- tative analysis of culture using millions of digitized books,”science, vol. 331, no. 6014, pp. 176–182, 2011
work page 2011
-
[3]
Hathitrust. a research library at web scale,
H. Christenson, “Hathitrust. a research library at web scale,”Li- brary Resources & Technical Services, vol. 55, no. 2, pp. 93–102, 2011
work page 2011
-
[4]
Multi- level computational methods for interdisciplinary research in the hathitrust digital library,
J. Murdock, C. Allen, K. Börner, R. Light, S. McAlister, A. Raven- scroft, R. Rose, D. Rose, J. Otsuka, D. Bourget,et al., “Multi- level computational methods for interdisciplinary research in the hathitrust digital library,”PloS one, vol. 12, no. 9, p. e0184188, 2017
work page 2017
-
[5]
Underwood,Distant horizons: digital evidence and literary change
T . Underwood,Distant horizons: digital evidence and literary change. University of Chicago Press, 2019
work page 2019
-
[6]
The transformation of gender in english-language fiction,
T . Underwood, D. Bamman, and S. Lee, “The transformation of gender in english-language fiction,”Journal of Cultural Analytics, vol. 3, no. 2, 2018
work page 2018
-
[7]
The hathitrust digital library’ s potential for musicology research,
J. S. Downie, S. Bhattacharyya, F . Giannetti, E. D. Koehl, and P . Or- ganisciak, “The hathitrust digital library’ s potential for musicology research,”International Journal on Digital Libraries, vol. 21, no. 4, pp. 343–358, 2020
work page 2020
-
[8]
The dark history of hathitrust,
A. Centivany, “The dark history of hathitrust,” inProceedings of the 50th Hawaii International Conference on System Sciences, p. 1, 2017
work page 2017
-
[9]
Karaganis,Shadow libraries: Access to knowledge in global higher education
J. Karaganis,Shadow libraries: Access to knowledge in global higher education. The MIT Press, 2018
work page 2018
-
[10]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
H. Lu, W . Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T . Ren, Z. Li, H. Yang,et al., “Deepseek-vl: towards real-world vision-language understanding,”arXiv preprint arXiv:2403.05525, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Richard Kadrey, et al. v. Meta Platforms, Inc., 2025. Case No. 23-cv- 03417-VC (ND Cal)
work page 2025
-
[12]
Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson v. An- thropic PBC, 2025. No. C 24-05417 WHA (ND Cal)
work page 2025
- [13]
-
[14]
“ Anna’ s archive.”https://annas-archive.org/, 2022. Ac- cessed: 2025-06-06
work page 2022
- [15]
-
[16]
M. Walsh and M. Antoniak, “The goodreads “classics”: a computa- tional study of readers, amazon, and crowdsourced amateur criti- cism,”Journal of Cultural Analytics, vol. 6, no. 2, pp. 243–287, 2021
work page 2021
-
[17]
The afterlives of shakespeare and company in online social readership,
M. Antoniak, D. Mimno, R. Thalken, M. Walsh, M. Wilkens, and G. Yauney, “The afterlives of shakespeare and company in online social readership,”arXiv preprint arXiv:2401.07340, 2024
-
[18]
The social lives of books: Reading vic- torian literature on goodreads,
K. Bourrier and M. Thelwall, “The social lives of books: Reading vic- torian literature on goodreads,”Journal of Cultural Analytics, vol. 5, no. 1, 2020
work page 2020
-
[19]
Goodreads reviews to as- sess the wider impacts of books,
K. Kousha, M. Thelwall, and M. Abdoli, “Goodreads reviews to as- sess the wider impacts of books,”Journal of the Association for In- formation Science and Technology, vol. 68, no. 8, pp. 2004–2016, 2017
work page 2004
-
[20]
Y. Hu, J. Diesner, T . Underwood, Z. LeBlanc, G. Layne-Worthey, and J. S. Downie, “Who decides what is read on goodreads? uncovering sponsorship and its implications for scholarly research,”Big Data & Society, vol. 12, no. 3, p. 20539517251359229, 2025
work page 2025
-
[21]
Functional require- ments for bibliographic records
I. F . of Library Associations and Institutions, “Functional require- ments for bibliographic records.”https://repository.ifla. org/handle/20.500.14598/830, 1998
work page 1998
-
[22]
The slaughterhouse of literature,
F . Moretti, “The slaughterhouse of literature,”MLQ: Modern Lan- guage Quarterly, vol. 61, no. 1, pp. 207–227, 2000
work page 2000
-
[23]
A. Swartz, B. Kahle, A. Rossi, A. Chitipothu, and R. Hargrave Mala- mud, “Openlibrary.”https://openlibrary.org/, 2006. Ac- cessed: 2025-06-06
work page 2006
-
[24]
Item recommendation on mono- tonic behavior chains,
M. Wan and J. J. McAuley, “Item recommendation on mono- tonic behavior chains,” inProceedings of the 12th ACM Conference on Recommender Systems, RecSys 2018, Vancouver , BC, Canada, October 2-7, 2018(S. Pera, M. D. Ekstrand, X. Amatriain, and J. O’Donovan, eds.), pp. 86–94, ACM, 2018
work page 2018
-
[25]
Fine-grained spoiler detection from large-scale review corpora,
M. Wan, R. Misra, N. Nakashole, and J. J. McAuley, “Fine-grained spoiler detection from large-scale review corpora,” inProceedings of the 57th Conference of the Association for Computational Lin- guistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers(A. Korhonen, D. R. Traum, and L. Màrquez, eds.), pp. 2605–2610, Association f...
work page 2019
-
[26]
Goodreads book datasets with user rating 2m
“Goodreads book datasets with user rating 2m.”https: //www.kaggle.com/datasets/bahramjannesarr/ goodreads-book-datasets-10m, 2020. Accessed: 2025-07- 10
work page 2020
-
[27]
“Goodreads books.”https://huggingface.co/datasets/ BrightData/Goodreads-Books, 2024. Accessed: 2025-07-10
work page 2024
-
[28]
The globalization of copyright exceptions for ai training,
M. Sag and P . K. Yu, “The globalization of copyright exceptions for ai training,”Emory LJ, vol. 74, p. 1163, 2024
work page 2024
-
[29]
Feist Publications, Inc. v. Rural Telephone Service Co
“Feist Publications, Inc. v. Rural Telephone Service Co..” 499 U.S. 340, 1991. United States Supreme Court
work page 1991
-
[30]
European Parliament and Council, “Directive 96/9/EC of the Euro- pean Parliament and of the Council of 11 March 1996 on the legal protection of databases.” OJ L 77, p. 20, 1996
work page 1996
-
[31]
Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019, “on copyright and related rights in the digital single market and amending directives 96/9/ec and 2001/29/ec,” 2019. Document 32019L0790
work page 2019
- [32]
-
[33]
37 C.F .R. § 201.40 — “exemptions to prohibition against circum- vention
“37 C.F .R. § 201.40 — “exemptions to prohibition against circum- vention” .” Electronic Code of Federal Regulations, 2023. Accessed: 2025-10-20. 9
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.