Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research
Pith reviewed 2026-05-15 21:39 UTC · model grok-4.3
The pith
A new 30-billion-word corpus of Italian discussion board messages supports native LLM pre-training and studies of online language use.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present Testimole-conversational, a massive collection of discussion board messages in Italian totaling more than 30 billion word-tokens from 1996 to 2024. This corpus serves as an ideal dataset for pre-training native Italian large language models. Discussion board messages also provide a relevant resource for linguistic and sociological analysis by capturing computer-mediated communication, informal written Italian, discourse dynamics, and online social interaction over a wide time span.
What carries the argument
The Testimole-conversational corpus, a compiled set of Italian discussion board messages that aggregates informal digital texts across decades for language model training and sociolinguistic study.
If this is right
- It enables pre-training of large language models using authentic native Italian data rather than translated material.
- It supports tracking of language variation and change in informal written Italian over nearly thirty years.
- It provides material for analysis of discourse patterns and social interactions in online discussion environments.
- It facilitates domain adaptation and conversational analysis tasks in natural language processing.
- The long time span permits studies of how digital communication has evolved from 1996 to 2024.
Where Pith is reading between the lines
- Similar large-scale corpora could be assembled for other languages that currently lack sufficient native training data.
- Models trained on this data may capture informal registers and generational language features better than those relying on formal texts.
- Cross-comparison with English forum corpora could highlight language-specific patterns in online social behavior.
- Future extensions might link messages to shared media to study multimodal aspects of digital conversation.
Load-bearing premise
The collected and processed messages form a clean, representative, and high-quality dataset suitable for model training without major biases or errors introduced during collection.
What would settle it
A direct comparison showing that models trained on this corpus perform no better than those trained on smaller, manually curated Italian datasets, or evidence of widespread unfiltered spam, duplicates, or non-Italian content.
Figures
read the original abstract
We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards' messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents 'Testimole-conversational', a corpus of more than 30 billion word tokens drawn from Italian discussion-board messages spanning 1996–2024. It positions the resource as suitable for pre-training Italian LLMs and for sociolinguistic analysis of computer-mediated communication, informal written Italian, and online social dynamics.
Significance. A large-scale, temporally extended Italian conversational corpus would address a clear gap in resources for low-resource language modeling and digital sociolinguistics. If the data are shown to be sufficiently clean and representative, the release could support improved domain adaptation, conversational modeling, and longitudinal studies of language variation.
major comments (2)
- [Abstract] Abstract: The central claim that the corpus 'renders it an ideal dataset for native Italian Large Language Models' pre-training' is unsupported because the manuscript provides no description of collection methods, language identification, length or perplexity filtering, MinHash/exact deduplication, or any post-processing steps that would remove spam, boilerplate, code, or non-Italian text.
- [Corpus construction] Corpus construction section (or equivalent): No quantitative metrics are reported on the fraction of tokens retained after cleaning, the rate of near-duplicates, or the distribution of message lengths and perplexity scores. These statistics are required to convert the headline 30 B token figure into a usable training-token count.
minor comments (2)
- [Abstract] Abstract: 'Models'pre-training' is missing a space; 'it also support investigations' should read 'it also supports investigations'.
- [Abstract] The abstract states the corpus 'will be made freely available' but supplies no license, access URL, or citation format; these details should appear in a dedicated 'Availability' paragraph.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and will revise the manuscript to incorporate the requested details on corpus construction and supporting metrics.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the corpus 'renders it an ideal dataset for native Italian Large Language Models' pre-training' is unsupported because the manuscript provides no description of collection methods, language identification, length or perplexity filtering, MinHash/exact deduplication, or any post-processing steps that would remove spam, boilerplate, code, or non-Italian text.
Authors: We agree that the abstract claim would be strengthened by explicit reference to the underlying methods. In the revised manuscript we will shorten the abstract claim slightly and add a brief clause noting the use of language identification, length/perplexity filtering, and deduplication. We will also insert a new dedicated 'Corpus Construction' section that fully describes data sourcing, language identification, filtering criteria, MinHash and exact deduplication, and post-processing steps for spam, boilerplate, code, and non-Italian content. revision: yes
-
Referee: [Corpus construction] Corpus construction section (or equivalent): No quantitative metrics are reported on the fraction of tokens retained after cleaning, the rate of near-duplicates, or the distribution of message lengths and perplexity scores. These statistics are required to convert the headline 30 B token figure into a usable training-token count.
Authors: We acknowledge that these quantitative metrics are necessary for readers to assess the effective training volume. We will add a subsection within the new 'Corpus Construction' section that reports (i) token retention rates after each cleaning stage, (ii) the fraction of near-duplicates removed via MinHash and exact matching, and (iii) summary statistics and distributions for message lengths and perplexity scores. These additions will allow direct conversion of the raw 30 B token count into a usable training-token estimate. revision: yes
Circularity Check
No significant circularity in corpus resource announcement
full rationale
This is a data resource paper announcing the collection and release of a 30B-token Italian discussion-board corpus. It contains no equations, derivations, predictions, fitted parameters, or uniqueness theorems. The central claim is descriptive (the corpus exists and has the stated size and time span) and does not reduce to any self-referential input by construction. No self-citations are load-bearing for any derivation, and no ansatz or renaming of known results occurs. The paper is self-contained as a corpus description; absence of quality-filtering details is a completeness issue, not circularity.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present 'Testimole-conversational' a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models' pre-training.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introduction Over the past three decades, a new form of writ- ten communication has emerged due to the dif- fusion of digital communication networks among the general public. This constituted a revolution- ary event in the history of written language, as the digital medium began to be massively used as a form of communication meant for ordinary, sponta- n...
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
Related Work The potentiality of gathering large text corpora from discussion boards was already explored more than thirty years ago. (Lund and Burgess, 1996; Burgess and Livesay, 1998), notably, compiled the HAL Corpus by collecting 131 million words from UsenetoverthecourseofFebruary1995. TheHAL Corpus was used to train a model encoding seman- tic and g...
work page 1996
-
[3]
posts”, are organized in “threads
TheTestiMole-Conversational Resource 3.1. Discussion Boards The text sources of theTestiMole- Conversationalcorpus are two types of discussion boards. Discussion boards are plat- forms where users can exchange messages on specific topics. Messages, called “posts”, are organized in “threads” that refer to a very specific topic of discussion, usually identi...
work page 1979
-
[4]
TheTestiMoleDataset TestiMole-Conversationalis part of a larger dataset originally created in order to provide the academic community with better resources to train different kind of language models employ- ing high-quality native Italian resources, also for long-context training. In this work, we decided to focus on the "conversational" subset as a scien...
-
[5]
Conclusion In an historical period characterized by the rise of Large Language Models and the consequent quest for large and clean datasets,TestiMolestands out as an important resource for improving the ca- pabilities of natively Italian as well as multilingual LMs to correctly model peculiar elements of Ital- ian language and society, drawing from the ri...
- [6]
-
[7]
Limitations Given the substantial manual work involved in de- signing appropriate collection strategies from di- verseplatforms, itwasnotpossibletoincludeevery Italian discussion board in the corpus; neither it is possible to quantify the proportion of the collected resource over the total. Further sources could have probably been retrieved, but we believ...
-
[8]
To mit- igate these risks, we anonymized all usernames from the corpus
Ethical considerations From an ethical standpoint, the collection of online conversationaldataraisesconcernsregardinguser privacy and consent, even when such content was publicly accessible at the time of collection. To mit- igate these risks, we anonymized all usernames from the corpus. We assume that users followed Figure 4: Top 50 newsgroups by total c...
work page 2022
-
[9]
• The get_url function connects to the server and attempts to retrieve the page
Appendix The script used to web-scrape the data is com- posed of four main functions: • Themainfunction, given an URL pattern de- fined by a prefix, a range, and a suffix, calls theget_urlfunction for all forum topics. • The get_url function connects to the server and attempts to retrieve the page. In particular, it uses Python’srequests library to down- ...
-
[10]
Bibliographical References Giuseppe Antonelli. 2016. L’e-taliano tra storia e leggende. InL’e-taliano. Scriventi e scritture nell’era digitale, pages 11–28. Franco Cesati Ed- itore. Emanuele Ferdinando Barbera. 2013.Una intro- duzione ai NUNC: storia della creazione di un corpus, volume Molti occhi sono meglio di uno: saggi di linguistica generale 2008-12...
work page 2016
-
[11]
Qu. A. S. A. R. Manuel Barbera and Carla Marello. 2011. Tra scritto-parlato, Umgangssprache e comuni- cazione in rete: i corpora NUNC. In«Studi di GrammaticaItaliana»XXVII(2008, recte2011)= Per Giovanni Nencioni. Convegno Internazionale di Studi. Pisa - Firenze, 4-5 Maggio 2009, pages 157– 185. Le Lettere. Marco Baroni, Silvia Bernardini, Adriano Ferrares...
work page 2011
-
[12]
Long-term social media data collection at the university of turin. InProceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), pages 41–46, Turin, Italy. CEUR Workshop Proceedings. Vladimír Benko. 2014. Aranea: Yet another fam- ily of (comparable) web corpora. InInterna- tionalConferenceonText,Speech,andDialogue, pages 247–2...
work page 2018
-
[13]
Large-scale databases of proper names. Behavior research methods, instruments, & com- puters : a journal of the Psychonomic Society, Inc, 31:215–9. Helmut Feldweg, Ralf Kibinger, and Christine Thie- len. 1995. Zum sprachgebrauch in deutschen news- gruppen. InNeue Medien. Osn- abrücker BeiträgezurSprachtheorie, pages143–154.Old- enburg: Red. OBST. Paolo Ga...
work page 1995
-
[14]
Roland Schäfer, Felix Bildhauer, et al
AlBERTo: Modeling Italian Social Media Language with BERT.IJCoL [Online]. Roland Schäfer, Felix Bildhauer, et al. 2012. Build- ing large corpora from the web using a new effi- cient tool chain. InLrec, pages 486–493. Jasmin Schröck and Harald Lüngen. 2015. Build- ing and annotating a corpus of german-language newsgroups. InNLP4CMC 2015. 2nd Workshop on Na...
work page 2012
-
[15]
2011.NUNC - A Multilanguage Suite of Newsgroups Cor- pora
Language Resource References Barbera, Manuel and Colombo, Simona and Marello, Carla. 2011.NUNC - A Multilanguage Suite of Newsgroups Cor- pora. Università degli Studi di Torino. PID http://www.bmanuel.org/projects/ng- HOME.html. Shaoul, Cyrus and Westbury, Chris. 2013.A reduced redundancy USENET corpus (2005- 2011). Edmonton, AB: University of Alberta. PI...
work page 2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.