MajinBook: An open catalogue of digitally mediated world literature

Antoine Mazi\`eres; Thierry Poibeau

arxiv: 2511.11412 · v5 · submitted 2025-11-14 · 💻 cs.CL · cs.CY· stat.OT

MajinBook: An open catalogue of digitally mediated world literature

Antoine Mazi\`eres , Thierry Poibeau This is my paper

Pith reviewed 2026-05-17 21:53 UTC · model grok-4.3

classification 💻 cs.CL cs.CYstat.OT

keywords shadow librariesdigital bookscatalogueGoodreadstext miningbibliographic datacomputational social sciencedigital humanities

0 comments

The pith

Linking shadow library metadata with Goodreads creates a high-precision corpus of over 539,000 digitally mediated English books.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MajinBook as an open catalogue that connects metadata from shadow libraries such as Library Genesis and Z-Library with structured bibliographic data from Goodreads. This linkage produces a corpus of more than 539,000 references to English-language books in digital formats, spanning three centuries and including first publication dates, genres, and popularity metrics like ratings and reviews. The work prioritises natively digital EPUB files for machine readability and to reduce biases present in scanned corpora such as HathiTrust. It also supplies secondary datasets for French, German, and Spanish books, evaluates the accuracy of the matching process, and addresses the legal permissibility of using the data for research under existing frameworks.

Core claim

By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, the authors create a high-precision corpus of over 539,000 references to digitally mediated English-language books. Spanning three centuries and reflecting a contemporary selection bias, these entries are enriched with first publication dates, genres, and popularity metrics like ratings and reviews. The methodology prioritises natively digital EPUB files to ensure machine-readable quality while addressing biases in traditional corpora.

What carries the argument

The linkage strategy between shadow-library metadata and Goodreads records that produces high-precision matches for the corpus of digitally mediated books.

If this is right

The corpus supports computational social science and cultural analytics on a large scale using digital book data.
Researchers gain access to machine-readable EPUB files that avoid artifacts from scanned texts.
Analysis of selection biases and popularity trends becomes possible across three centuries of English-language literature.
Secondary datasets enable parallel studies for French, German, and Spanish books.
Open data release allows other teams to replicate or extend the catalogue for their own text-mining projects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The catalogue could be used to track how digital availability influences which older books remain widely read today.
Similar linkage techniques might be applied to other cultural datasets such as film or music archives to create comparable research resources.
Periodic updates to the corpus could incorporate newly added titles from the source libraries and reflect evolving popularity signals.

Load-bearing premise

The linkage strategy between shadow-library metadata and Goodreads records produces high-precision matches, and the resulting corpus can be used legally for text and data mining under EU and US research frameworks.

What would settle it

A manual audit of a random sample of linked entries that finds a high rate of incorrect matches or a legal ruling that prohibits research use of data derived from shadow libraries.

Figures

Figures reproduced from arXiv: 2511.11412 by Antoine Mazi\`eres, Thierry Poibeau.

**Figure 1.** Figure 1: Temporal distributions and biases of key corpora. The figure illustrates the distinct temporal biases of the key corpora, justifying our methodological focus on natively digital content. All three plots are semi-logarithmic (log y-axis), displaying item counts binned by publication decade. (a) Compares the EPUB and PDF subsets of shadow libraries. (b) Contrasts the scanned HathiTrust corpus with all Goodre… view at source ↗

**Figure 2.** Figure 2: The crawl of Goodreads: Item acquisition and recommendation decay. The figure illustrates the efficiency of our crawl methodology. The bars show the cumulative counts of Editions, Works, and Authors (left axis, in millions) gathered at each stage. The line plot tracks the number of new Recommendations (right axis, in thousands) discovered at each depth. The plot reveals a power-law distribution: the initia… view at source ↗

**Figure 3.** Figure 3: Precision-recall trade-off for book matching based on the title score threshold. The plot shows the point estimates for precision (dashed line) and recall (dotted line), along with their 95% confidence intervals (shaded areas), derived from bootstrap resampling of 143 human evaluations. The solid black line indicates the percentage of the dataset retained at each threshold. A vertical line marks our chose… view at source ↗

**Figure 4.** Figure 4: Temporal distribution of primary (English) v. secondary datasets. these two criteria—significant volume and a promising title score distribution—we selected the three of the largest non-English corpora for release: namely French (47,960 items), German (35,559), and Spanish (30,169). We must, however, stress that the precise quality of these matches remains unverified. We release these secondary catalogue… view at source ↗

read the original abstract

This data paper introduces MajinBook, an open catalogue designed to facilitate the use of shadow libraries-such as Library Genesis and Z-Library-for computational social science and cultural analytics. By linking metadata from these vast, crowd-sourced archives with structured bibliographic data from Goodreads, we create a high-precision corpus of over 539,000 references to digitally mediated English-language books. Spanning three centuries and reflecting a contemporary selection bias, these entries are enriched with first publication dates, genres, and popularity metrics like ratings and reviews. Our methodology prioritises natively digital EPUB files to ensure machine-readable quality, while addressing biases in traditional corpora like HathiTrust, and includes secondary datasets for French, German, and Spanish. We evaluate the linkage strategy for accuracy, release all underlying data openly, and discuss the project's legal permissibility under EU and US frameworks for text and data mining in research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MajinBook releases an open catalogue linking shadow-library metadata to Goodreads for over 539k English books plus smaller sets in other languages, but the linkage precision rests on an evaluation whose details are not shown in the abstract.

read the letter

The main takeaway here is that MajinBook is an open catalogue linking metadata from shadow libraries like Library Genesis and Z-Library with Goodreads data to produce a large set of references to digitally mediated books. The English portion has over 539,000 entries spanning three centuries but with a contemporary bias, and there are secondary datasets for French, German, and Spanish. The paper does a solid job on a few fronts. It emphasizes natively digital EPUB files to keep the quality high for machine processing. It directly addresses how corpora like HathiTrust tend to overrepresent older material and underrepresent recent digitally published works. Releasing all the data openly lets others inspect and use it, and the section on legal permissibility under EU and US text and data mining rules is practical for anyone planning to work with it. The linkage between the sources is the central technical step, and the abstract says they evaluated it for accuracy. However, without the specific algorithm, gold standard, or precision figures in the summary, it's difficult to assess how well it handles the inevitable noise in crowd-sourced metadata. Variant titles, incomplete author fields, and duplicates are common in both archives, so the high-precision claim needs the full methods to hold up. The stress-test concern about this is fair based on what's visible. This work is aimed at computational social scientists and cultural analytics researchers who need large-scale, open book metadata that includes popularity and genre information. A reader building or studying contemporary literature corpora would find the resource directly applicable. The paper shows honest engagement with the practical issues of data construction and legal use, which makes it worth sending to a serious referee. I would recommend peer review rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MajinBook, an open catalogue linking crowd-sourced metadata from shadow libraries (Library Genesis and Z-Library) with Goodreads bibliographic records to produce a corpus of over 539,000 references to digitally mediated English-language books. The entries span three centuries, incorporate first publication dates, genres, and popularity metrics, prioritize EPUB files for machine readability, and are accompanied by secondary datasets for French, German, and Spanish. The work claims to evaluate linkage accuracy, releases all data openly, and addresses legal permissibility for text and data mining under EU and US research frameworks.

Significance. If the linkage produces demonstrably high-precision matches, MajinBook would constitute a useful resource for computational social science and cultural analytics by supplying a large-scale, contemporary, machine-readable alternative to corpora such as HathiTrust that mitigates certain selection biases. The open release of the full underlying data and the provision of multilingual secondary datasets are concrete strengths that support reuse and extension.

major comments (2)

[Methods] Methods / Linkage Evaluation: The central claim of a 'high-precision corpus' of 539,000 references rests on the linkage between noisy crowd-sourced sources, yet the manuscript supplies no description of the matching algorithm (ISBN, fuzzy title/author, or hybrid), no precision/recall/F1 figures, and no account of the gold-standard validation set or protocol used to assess accuracy.
[Results] Results / Corpus Size: The headline quantitative result (over 539,000 references) is presented without accompanying error-rate estimates or false-positive analysis; because both input archives contain variant titles, missing fields, and crowd-sourced noise, the absence of these metrics leaves the reliability of the reported corpus size and composition unverified.

minor comments (2)

[Legal Discussion] The legal-permissibility discussion would benefit from explicit citations to relevant EU DSM Directive articles or US fair-use precedents rather than general statements of framework compatibility.
[Figures and Tables] Figure captions and table headings should explicitly state the exact matching criteria and any filtering thresholds applied during corpus construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback on the linkage methodology and quantitative reliability of the MajinBook corpus. We address each major comment below and will revise the manuscript to incorporate the requested details and metrics.

read point-by-point responses

Referee: [Methods] Methods / Linkage Evaluation: The central claim of a 'high-precision corpus' of 539,000 references rests on the linkage between noisy crowd-sourced sources, yet the manuscript supplies no description of the matching algorithm (ISBN, fuzzy title/author, or hybrid), no precision/recall/F1 figures, and no account of the gold-standard validation set or protocol used to assess accuracy.

Authors: We acknowledge that the current manuscript does not provide a detailed description of the matching algorithm, performance metrics, or validation protocol, even though it states that the linkage strategy was evaluated for accuracy. This is a substantive gap that weakens the claim of a high-precision corpus. We will add a new subsection to the Methods section that fully specifies the linkage procedure (a hybrid approach using ISBN matching where available, supplemented by fuzzy title/author matching with defined similarity thresholds), the construction and size of the gold-standard validation set, the annotation protocol, and the resulting precision, recall, and F1 scores. We will also include an error analysis of the main failure modes observed during validation. revision: yes
Referee: [Results] Results / Corpus Size: The headline quantitative result (over 539,000 references) is presented without accompanying error-rate estimates or false-positive analysis; because both input archives contain variant titles, missing fields, and crowd-sourced noise, the absence of these metrics leaves the reliability of the reported corpus size and composition unverified.

Authors: We agree that the reported corpus size of 539,000+ references must be accompanied by quantitative error-rate estimates and false-positive analysis to allow readers to assess reliability, particularly given the known noise in the source metadata. Although the manuscript claims an evaluation of linkage accuracy, the Results section currently lacks these supporting figures. We will revise the Results section to report the estimated false-positive rate (and overall error rate) derived from the validation set, along with a brief discussion of how these rates affect the final corpus size and genre/popularity distributions. This addition will directly address the concern about unverified reliability. revision: yes

Circularity Check

0 steps flagged

No circularity: data-construction paper with no derivations or fitted quantities

full rationale

This paper introduces and releases a new catalogue by linking existing crowd-sourced metadata sources (shadow libraries and Goodreads). The central output is the corpus itself rather than any derived quantity, prediction, or first-principles result. No equations, fitted parameters, self-definitional steps, or load-bearing self-citations appear in the provided text or abstract. The linkage claim is presented as a methodological choice whose accuracy is asserted to have been evaluated, but this evaluation is external to any internal derivation chain and does not reduce the output to the inputs by construction. The paper is therefore self-contained as a data release.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data paper describing corpus construction; it contains no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5450 in / 1017 out tokens · 22574 ms · 2026-05-17T21:53:31.879735+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 1 internal anchor

[1]

The history of corpus linguistics,

T . McEnery and A. Hardie, “The history of corpus linguistics,” in The Oxford handbook of the history of linguistics(K. Allan, ed.), Ox- ford University Press, 2013. 8

work page 2013
[2]

Quanti- tative analysis of culture using millions of digitized books,

J.-B. Michel, Y. K. Shen, A. P . Aiden, A. Veres, M. K. Gray, G. B. Team, J. P . Pickett, D. Hoiberg, D. Clancy, P . Norvig,et al., “Quanti- tative analysis of culture using millions of digitized books,”science, vol. 331, no. 6014, pp. 176–182, 2011

work page 2011
[3]

Hathitrust. a research library at web scale,

H. Christenson, “Hathitrust. a research library at web scale,”Li- brary Resources & Technical Services, vol. 55, no. 2, pp. 93–102, 2011

work page 2011
[4]

Multi- level computational methods for interdisciplinary research in the hathitrust digital library,

J. Murdock, C. Allen, K. Börner, R. Light, S. McAlister, A. Raven- scroft, R. Rose, D. Rose, J. Otsuka, D. Bourget,et al., “Multi- level computational methods for interdisciplinary research in the hathitrust digital library,”PloS one, vol. 12, no. 9, p. e0184188, 2017

work page 2017
[5]

Underwood,Distant horizons: digital evidence and literary change

T . Underwood,Distant horizons: digital evidence and literary change. University of Chicago Press, 2019

work page 2019
[6]

The transformation of gender in english-language fiction,

T . Underwood, D. Bamman, and S. Lee, “The transformation of gender in english-language fiction,”Journal of Cultural Analytics, vol. 3, no. 2, 2018

work page 2018
[7]

The hathitrust digital library’ s potential for musicology research,

J. S. Downie, S. Bhattacharyya, F . Giannetti, E. D. Koehl, and P . Or- ganisciak, “The hathitrust digital library’ s potential for musicology research,”International Journal on Digital Libraries, vol. 21, no. 4, pp. 343–358, 2020

work page 2020
[8]

The dark history of hathitrust,

A. Centivany, “The dark history of hathitrust,” inProceedings of the 50th Hawaii International Conference on System Sciences, p. 1, 2017

work page 2017
[9]

Karaganis,Shadow libraries: Access to knowledge in global higher education

J. Karaganis,Shadow libraries: Access to knowledge in global higher education. The MIT Press, 2018

work page 2018
[10]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

H. Lu, W . Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T . Ren, Z. Li, H. Yang,et al., “Deepseek-vl: towards real-world vision-language understanding,”arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Richard Kadrey, et al. v. Meta Platforms, Inc., 2025. Case No. 23-cv- 03417-VC (ND Cal)

work page 2025
[12]

An- thropic PBC, 2025

Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson v. An- thropic PBC, 2025. No. C 24-05417 WHA (ND Cal)

work page 2025
[13]

Library genesis

“Library genesis.”https://libgen.rs/, 2008. Accessed: 2025- 06-06

work page 2008
[14]

Anna’ s archive

“ Anna’ s archive.”https://annas-archive.org/, 2022. Ac- cessed: 2025-06-06

work page 2022
[15]

Goodreads

O. Chandler and E. Khuri, “Goodreads.”https://www. goodreads.com/, 2006. Accessed: 2025-06-06

work page 2006
[16]

The goodreads “classics

M. Walsh and M. Antoniak, “The goodreads “classics”: a computa- tional study of readers, amazon, and crowdsourced amateur criti- cism,”Journal of Cultural Analytics, vol. 6, no. 2, pp. 243–287, 2021

work page 2021
[17]

The afterlives of shakespeare and company in online social readership,

M. Antoniak, D. Mimno, R. Thalken, M. Walsh, M. Wilkens, and G. Yauney, “The afterlives of shakespeare and company in online social readership,”arXiv preprint arXiv:2401.07340, 2024

work page arXiv 2024
[18]

The social lives of books: Reading vic- torian literature on goodreads,

K. Bourrier and M. Thelwall, “The social lives of books: Reading vic- torian literature on goodreads,”Journal of Cultural Analytics, vol. 5, no. 1, 2020

work page 2020
[19]

Goodreads reviews to as- sess the wider impacts of books,

K. Kousha, M. Thelwall, and M. Abdoli, “Goodreads reviews to as- sess the wider impacts of books,”Journal of the Association for In- formation Science and Technology, vol. 68, no. 8, pp. 2004–2016, 2017

work page 2004
[20]

Who decides what is read on goodreads? uncovering sponsorship and its implications for scholarly research,

Y. Hu, J. Diesner, T . Underwood, Z. LeBlanc, G. Layne-Worthey, and J. S. Downie, “Who decides what is read on goodreads? uncovering sponsorship and its implications for scholarly research,”Big Data & Society, vol. 12, no. 3, p. 20539517251359229, 2025

work page 2025
[21]

Functional require- ments for bibliographic records

I. F . of Library Associations and Institutions, “Functional require- ments for bibliographic records.”https://repository.ifla. org/handle/20.500.14598/830, 1998

work page 1998
[22]

The slaughterhouse of literature,

F . Moretti, “The slaughterhouse of literature,”MLQ: Modern Lan- guage Quarterly, vol. 61, no. 1, pp. 207–227, 2000

work page 2000
[23]

Openlibrary

A. Swartz, B. Kahle, A. Rossi, A. Chitipothu, and R. Hargrave Mala- mud, “Openlibrary.”https://openlibrary.org/, 2006. Ac- cessed: 2025-06-06

work page 2006
[24]

Item recommendation on mono- tonic behavior chains,

M. Wan and J. J. McAuley, “Item recommendation on mono- tonic behavior chains,” inProceedings of the 12th ACM Conference on Recommender Systems, RecSys 2018, Vancouver , BC, Canada, October 2-7, 2018(S. Pera, M. D. Ekstrand, X. Amatriain, and J. O’Donovan, eds.), pp. 86–94, ACM, 2018

work page 2018
[25]

Fine-grained spoiler detection from large-scale review corpora,

M. Wan, R. Misra, N. Nakashole, and J. J. McAuley, “Fine-grained spoiler detection from large-scale review corpora,” inProceedings of the 57th Conference of the Association for Computational Lin- guistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers(A. Korhonen, D. R. Traum, and L. Màrquez, eds.), pp. 2605–2610, Association f...

work page 2019
[26]

Goodreads book datasets with user rating 2m

“Goodreads book datasets with user rating 2m.”https: //www.kaggle.com/datasets/bahramjannesarr/ goodreads-book-datasets-10m, 2020. Accessed: 2025-07- 10

work page 2020
[27]

Goodreads books

“Goodreads books.”https://huggingface.co/datasets/ BrightData/Goodreads-Books, 2024. Accessed: 2025-07-10

work page 2024
[28]

The globalization of copyright exceptions for ai training,

M. Sag and P . K. Yu, “The globalization of copyright exceptions for ai training,”Emory LJ, vol. 74, p. 1163, 2024

work page 2024
[29]

Feist Publications, Inc. v. Rural Telephone Service Co

“Feist Publications, Inc. v. Rural Telephone Service Co..” 499 U.S. 340, 1991. United States Supreme Court

work page 1991
[30]

Directive 96/9/EC of the Euro- pean Parliament and of the Council of 11 March 1996 on the legal protection of databases

European Parliament and Council, “Directive 96/9/EC of the Euro- pean Parliament and of the Council of 11 March 1996 on the legal protection of databases.” OJ L 77, p. 20, 1996

work page 1996
[31]

on copyright and related rights in the digital single market and amending directives 96/9/ec and 2001/29/ec,

Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019, “on copyright and related rights in the digital single market and amending directives 96/9/ec and 2001/29/ec,” 2019. Document 32019L0790

work page 2019
[32]

Copyright act of 1976,

“Copyright act of 1976,” 1976. 17 U.S.C. § 107 (Fair Use)

work page 1976
[33]

37 C.F .R. § 201.40 — “exemptions to prohibition against circum- vention

“37 C.F .R. § 201.40 — “exemptions to prohibition against circum- vention” .” Electronic Code of Federal Regulations, 2023. Accessed: 2025-10-20. 9

work page 2023

[1] [1]

The history of corpus linguistics,

T . McEnery and A. Hardie, “The history of corpus linguistics,” in The Oxford handbook of the history of linguistics(K. Allan, ed.), Ox- ford University Press, 2013. 8

work page 2013

[2] [2]

Quanti- tative analysis of culture using millions of digitized books,

J.-B. Michel, Y. K. Shen, A. P . Aiden, A. Veres, M. K. Gray, G. B. Team, J. P . Pickett, D. Hoiberg, D. Clancy, P . Norvig,et al., “Quanti- tative analysis of culture using millions of digitized books,”science, vol. 331, no. 6014, pp. 176–182, 2011

work page 2011

[3] [3]

Hathitrust. a research library at web scale,

H. Christenson, “Hathitrust. a research library at web scale,”Li- brary Resources & Technical Services, vol. 55, no. 2, pp. 93–102, 2011

work page 2011

[4] [4]

Multi- level computational methods for interdisciplinary research in the hathitrust digital library,

J. Murdock, C. Allen, K. Börner, R. Light, S. McAlister, A. Raven- scroft, R. Rose, D. Rose, J. Otsuka, D. Bourget,et al., “Multi- level computational methods for interdisciplinary research in the hathitrust digital library,”PloS one, vol. 12, no. 9, p. e0184188, 2017

work page 2017

[5] [5]

Underwood,Distant horizons: digital evidence and literary change

T . Underwood,Distant horizons: digital evidence and literary change. University of Chicago Press, 2019

work page 2019

[6] [6]

The transformation of gender in english-language fiction,

T . Underwood, D. Bamman, and S. Lee, “The transformation of gender in english-language fiction,”Journal of Cultural Analytics, vol. 3, no. 2, 2018

work page 2018

[7] [7]

The hathitrust digital library’ s potential for musicology research,

J. S. Downie, S. Bhattacharyya, F . Giannetti, E. D. Koehl, and P . Or- ganisciak, “The hathitrust digital library’ s potential for musicology research,”International Journal on Digital Libraries, vol. 21, no. 4, pp. 343–358, 2020

work page 2020

[8] [8]

The dark history of hathitrust,

A. Centivany, “The dark history of hathitrust,” inProceedings of the 50th Hawaii International Conference on System Sciences, p. 1, 2017

work page 2017

[9] [9]

Karaganis,Shadow libraries: Access to knowledge in global higher education

J. Karaganis,Shadow libraries: Access to knowledge in global higher education. The MIT Press, 2018

work page 2018

[10] [10]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

H. Lu, W . Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T . Ren, Z. Li, H. Yang,et al., “Deepseek-vl: towards real-world vision-language understanding,”arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Richard Kadrey, et al. v. Meta Platforms, Inc., 2025. Case No. 23-cv- 03417-VC (ND Cal)

work page 2025

[12] [12]

An- thropic PBC, 2025

Andrea Bartz, Charles Graeber, and Kirk Wallace Johnson v. An- thropic PBC, 2025. No. C 24-05417 WHA (ND Cal)

work page 2025

[13] [13]

Library genesis

“Library genesis.”https://libgen.rs/, 2008. Accessed: 2025- 06-06

work page 2008

[14] [14]

Anna’ s archive

“ Anna’ s archive.”https://annas-archive.org/, 2022. Ac- cessed: 2025-06-06

work page 2022

[15] [15]

Goodreads

O. Chandler and E. Khuri, “Goodreads.”https://www. goodreads.com/, 2006. Accessed: 2025-06-06

work page 2006

[16] [16]

The goodreads “classics

M. Walsh and M. Antoniak, “The goodreads “classics”: a computa- tional study of readers, amazon, and crowdsourced amateur criti- cism,”Journal of Cultural Analytics, vol. 6, no. 2, pp. 243–287, 2021

work page 2021

[17] [17]

The afterlives of shakespeare and company in online social readership,

M. Antoniak, D. Mimno, R. Thalken, M. Walsh, M. Wilkens, and G. Yauney, “The afterlives of shakespeare and company in online social readership,”arXiv preprint arXiv:2401.07340, 2024

work page arXiv 2024

[18] [18]

The social lives of books: Reading vic- torian literature on goodreads,

K. Bourrier and M. Thelwall, “The social lives of books: Reading vic- torian literature on goodreads,”Journal of Cultural Analytics, vol. 5, no. 1, 2020

work page 2020

[19] [19]

Goodreads reviews to as- sess the wider impacts of books,

K. Kousha, M. Thelwall, and M. Abdoli, “Goodreads reviews to as- sess the wider impacts of books,”Journal of the Association for In- formation Science and Technology, vol. 68, no. 8, pp. 2004–2016, 2017

work page 2004

[20] [20]

Who decides what is read on goodreads? uncovering sponsorship and its implications for scholarly research,

Y. Hu, J. Diesner, T . Underwood, Z. LeBlanc, G. Layne-Worthey, and J. S. Downie, “Who decides what is read on goodreads? uncovering sponsorship and its implications for scholarly research,”Big Data & Society, vol. 12, no. 3, p. 20539517251359229, 2025

work page 2025

[21] [21]

Functional require- ments for bibliographic records

I. F . of Library Associations and Institutions, “Functional require- ments for bibliographic records.”https://repository.ifla. org/handle/20.500.14598/830, 1998

work page 1998

[22] [22]

The slaughterhouse of literature,

F . Moretti, “The slaughterhouse of literature,”MLQ: Modern Lan- guage Quarterly, vol. 61, no. 1, pp. 207–227, 2000

work page 2000

[23] [23]

Openlibrary

A. Swartz, B. Kahle, A. Rossi, A. Chitipothu, and R. Hargrave Mala- mud, “Openlibrary.”https://openlibrary.org/, 2006. Ac- cessed: 2025-06-06

work page 2006

[24] [24]

Item recommendation on mono- tonic behavior chains,

M. Wan and J. J. McAuley, “Item recommendation on mono- tonic behavior chains,” inProceedings of the 12th ACM Conference on Recommender Systems, RecSys 2018, Vancouver , BC, Canada, October 2-7, 2018(S. Pera, M. D. Ekstrand, X. Amatriain, and J. O’Donovan, eds.), pp. 86–94, ACM, 2018

work page 2018

[25] [25]

Fine-grained spoiler detection from large-scale review corpora,

M. Wan, R. Misra, N. Nakashole, and J. J. McAuley, “Fine-grained spoiler detection from large-scale review corpora,” inProceedings of the 57th Conference of the Association for Computational Lin- guistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers(A. Korhonen, D. R. Traum, and L. Màrquez, eds.), pp. 2605–2610, Association f...

work page 2019

[26] [26]

Goodreads book datasets with user rating 2m

“Goodreads book datasets with user rating 2m.”https: //www.kaggle.com/datasets/bahramjannesarr/ goodreads-book-datasets-10m, 2020. Accessed: 2025-07- 10

work page 2020

[27] [27]

Goodreads books

“Goodreads books.”https://huggingface.co/datasets/ BrightData/Goodreads-Books, 2024. Accessed: 2025-07-10

work page 2024

[28] [28]

The globalization of copyright exceptions for ai training,

M. Sag and P . K. Yu, “The globalization of copyright exceptions for ai training,”Emory LJ, vol. 74, p. 1163, 2024

work page 2024

[29] [29]

Feist Publications, Inc. v. Rural Telephone Service Co

“Feist Publications, Inc. v. Rural Telephone Service Co..” 499 U.S. 340, 1991. United States Supreme Court

work page 1991

[30] [30]

Directive 96/9/EC of the Euro- pean Parliament and of the Council of 11 March 1996 on the legal protection of databases

European Parliament and Council, “Directive 96/9/EC of the Euro- pean Parliament and of the Council of 11 March 1996 on the legal protection of databases.” OJ L 77, p. 20, 1996

work page 1996

[31] [31]

on copyright and related rights in the digital single market and amending directives 96/9/ec and 2001/29/ec,

Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019, “on copyright and related rights in the digital single market and amending directives 96/9/ec and 2001/29/ec,” 2019. Document 32019L0790

work page 2019

[32] [32]

Copyright act of 1976,

“Copyright act of 1976,” 1976. 17 U.S.C. § 107 (Fair Use)

work page 1976

[33] [33]

37 C.F .R. § 201.40 — “exemptions to prohibition against circum- vention

“37 C.F .R. § 201.40 — “exemptions to prohibition against circum- vention” .” Electronic Code of Federal Regulations, 2023. Accessed: 2025-10-20. 9

work page 2023