Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Matteo Rinaldi; Rossella Varvara; Viviana Patti

arxiv: 2602.14819 · v2 · submitted 2026-02-16 · 💻 cs.CL

Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Matteo Rinaldi , Rossella Varvara , Viviana Patti This is my paper

Pith reviewed 2026-05-15 21:39 UTC · model grok-4.3

classification 💻 cs.CL

keywords Italian corpusdiscussion boardslanguage modelingsociolinguistic researchlarge language modelscomputer-mediated communicationpre-training dataItalian language

0 comments

The pith

A new 30-billion-word corpus of Italian discussion board messages supports native LLM pre-training and studies of online language use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Testimole-conversational, a collection of more than 30 billion words from Italian discussion boards spanning 1996 to 2024. This dataset is positioned as a high-quality resource for pre-training large language models in Italian, addressing the scarcity of native-language training data. It also enables examination of informal written Italian, discourse dynamics, and social interactions across nearly three decades of computer-mediated communication. The corpus is intended for applications in language modeling, domain adaptation, and conversational analysis while supporting investigations of language variation and digital social phenomena. The authors will release the resource freely to the research community.

Core claim

We present Testimole-conversational, a massive collection of discussion board messages in Italian totaling more than 30 billion word-tokens from 1996 to 2024. This corpus serves as an ideal dataset for pre-training native Italian large language models. Discussion board messages also provide a relevant resource for linguistic and sociological analysis by capturing computer-mediated communication, informal written Italian, discourse dynamics, and online social interaction over a wide time span.

What carries the argument

The Testimole-conversational corpus, a compiled set of Italian discussion board messages that aggregates informal digital texts across decades for language model training and sociolinguistic study.

If this is right

It enables pre-training of large language models using authentic native Italian data rather than translated material.
It supports tracking of language variation and change in informal written Italian over nearly thirty years.
It provides material for analysis of discourse patterns and social interactions in online discussion environments.
It facilitates domain adaptation and conversational analysis tasks in natural language processing.
The long time span permits studies of how digital communication has evolved from 1996 to 2024.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar large-scale corpora could be assembled for other languages that currently lack sufficient native training data.
Models trained on this data may capture informal registers and generational language features better than those relying on formal texts.
Cross-comparison with English forum corpora could highlight language-specific patterns in online social behavior.
Future extensions might link messages to shared media to study multimodal aspects of digital conversation.

Load-bearing premise

The collected and processed messages form a clean, representative, and high-quality dataset suitable for model training without major biases or errors introduced during collection.

What would settle it

A direct comparison showing that models trained on this corpus perform no better than those trained on smaller, manually curated Italian datasets, or evidence of widespread unfiltered spam, duplicates, or non-Italian content.

Figures

Figures reproduced from arXiv: 2602.14819 by Matteo Rinaldi, Rossella Varvara, Viviana Patti.

**Figure 2.** Figure 2: Usenet - Number of tokens per year 0 500000000 1E+09 1,5E+09 2E+09 2,5E+09 1991 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 Tokens Year [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Forums - Number of tokens per year this type of discussion boards has probably lost success, with other platforms (such as social networks) becoming digital places for discussions. It is interesting to note that the distribution of tokens among years differ for the two subgroups of data: Usenet data reach the peak of tokens in 2003, while for forums data we have the highest number of tokens for the year … view at source ↗

**Figure 4.** Figure 4: Top 50 newsgroups by total character count (all periods combined). [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Top 50 forums by total character count (all periods combined). [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Normalized frequencies of six words across time in the [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards' messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This announces a large Italian discussion corpus but skips the evidence on data quality and cleaning.

read the letter

The one thing to know is that this paper announces a 30 billion word Italian corpus from discussion boards over nearly 30 years, but it gives no evidence on how the data was gathered or cleaned. What is new is the scale for Italian conversational text. Most available Italian data is smaller or more formal, so a resource this size focused on informal online messages could help with building better native models and with research on language in digital contexts. It does well at spelling out the applications: pre-training LLMs, domain adaptation, conversational analysis, and sociolinguistic studies of variation and social phenomena. The time span from 1996 to 2024 is a strength because it allows looking at changes over the history of the web in Italy. The soft spots are in the missing details. The abstract states the size and intended uses but provides no information on collection methods, filtering, deduplication, or validation. Forum data usually requires substantial work to remove spam, boilerplate, non-Italian content, and near-duplicates. Without any quantitative description of those steps, the suitability for pre-training is not supported by evidence. The weakest assumption is that the collected messages form a clean and representative dataset. There is no math or circular claims here. It's a resource paper. This paper is for computational linguists working on Italian or low-resource languages and for researchers in digital sociolinguistics. A reader who needs large-scale conversational Italian text would get value from it if the data and processing details are released. I recommend sending it for peer review. The idea targets a genuine gap, and with a methods section added it could be a solid contribution. As it is, the evidence is too thin, but the work deserves referee time to see the full picture.

Referee Report

2 major / 2 minor

Summary. The manuscript presents 'Testimole-conversational', a corpus of more than 30 billion word tokens drawn from Italian discussion-board messages spanning 1996–2024. It positions the resource as suitable for pre-training Italian LLMs and for sociolinguistic analysis of computer-mediated communication, informal written Italian, and online social dynamics.

Significance. A large-scale, temporally extended Italian conversational corpus would address a clear gap in resources for low-resource language modeling and digital sociolinguistics. If the data are shown to be sufficiently clean and representative, the release could support improved domain adaptation, conversational modeling, and longitudinal studies of language variation.

major comments (2)

[Abstract] Abstract: The central claim that the corpus 'renders it an ideal dataset for native Italian Large Language Models' pre-training' is unsupported because the manuscript provides no description of collection methods, language identification, length or perplexity filtering, MinHash/exact deduplication, or any post-processing steps that would remove spam, boilerplate, code, or non-Italian text.
[Corpus construction] Corpus construction section (or equivalent): No quantitative metrics are reported on the fraction of tokens retained after cleaning, the rate of near-duplicates, or the distribution of message lengths and perplexity scores. These statistics are required to convert the headline 30 B token figure into a usable training-token count.

minor comments (2)

[Abstract] Abstract: 'Models'pre-training' is missing a space; 'it also support investigations' should read 'it also supports investigations'.
[Abstract] The abstract states the corpus 'will be made freely available' but supplies no license, access URL, or citation format; these details should appear in a dedicated 'Availability' paragraph.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and will revise the manuscript to incorporate the requested details on corpus construction and supporting metrics.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the corpus 'renders it an ideal dataset for native Italian Large Language Models' pre-training' is unsupported because the manuscript provides no description of collection methods, language identification, length or perplexity filtering, MinHash/exact deduplication, or any post-processing steps that would remove spam, boilerplate, code, or non-Italian text.

Authors: We agree that the abstract claim would be strengthened by explicit reference to the underlying methods. In the revised manuscript we will shorten the abstract claim slightly and add a brief clause noting the use of language identification, length/perplexity filtering, and deduplication. We will also insert a new dedicated 'Corpus Construction' section that fully describes data sourcing, language identification, filtering criteria, MinHash and exact deduplication, and post-processing steps for spam, boilerplate, code, and non-Italian content. revision: yes
Referee: [Corpus construction] Corpus construction section (or equivalent): No quantitative metrics are reported on the fraction of tokens retained after cleaning, the rate of near-duplicates, or the distribution of message lengths and perplexity scores. These statistics are required to convert the headline 30 B token figure into a usable training-token count.

Authors: We acknowledge that these quantitative metrics are necessary for readers to assess the effective training volume. We will add a subsection within the new 'Corpus Construction' section that reports (i) token retention rates after each cleaning stage, (ii) the fraction of near-duplicates removed via MinHash and exact matching, and (iii) summary statistics and distributions for message lengths and perplexity scores. These additions will allow direct conversion of the raw 30 B token count into a usable training-token estimate. revision: yes

Circularity Check

0 steps flagged

No significant circularity in corpus resource announcement

full rationale

This is a data resource paper announcing the collection and release of a 30B-token Italian discussion-board corpus. It contains no equations, derivations, predictions, fitted parameters, or uniqueness theorems. The central claim is descriptive (the corpus exists and has the stated size and time span) and does not reduce to any self-referential input by construction. No self-citations are load-bearing for any derivation, and no ansatz or renaming of known results occurs. The paper is self-contained as a corpus description; absence of quality-filtering details is a completeness issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data resource paper with no mathematical model, derivations, or theoretical claims. No free parameters, axioms, or invented entities are present.

pith-pipeline@v0.9.0 · 5445 in / 1056 out tokens · 23243 ms · 2026-05-15T21:39:41.635472+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present 'Testimole-conversational' a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models' pre-training.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

[1]

Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Introduction Over the past three decades, a new form of writ- ten communication has emerged due to the dif- fusion of digital communication networks among the general public. This constituted a revolution- ary event in the history of written language, as the digital medium began to be massively used as a form of communication meant for ordinary, sponta- n...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Usenet as a text corpus

Related Work The potentiality of gathering large text corpora from discussion boards was already explored more than thirty years ago. (Lund and Burgess, 1996; Burgess and Livesay, 1998), notably, compiled the HAL Corpus by collecting 131 million words from UsenetoverthecourseofFebruary1995. TheHAL Corpus was used to train a model encoding seman- tic and g...

work page 1996
[3]

posts”, are organized in “threads

TheTestiMole-Conversational Resource 3.1. Discussion Boards The text sources of theTestiMole- Conversationalcorpus are two types of discussion boards. Discussion boards are plat- forms where users can exchange messages on specific topics. Messages, called “posts”, are organized in “threads” that refer to a very specific topic of discussion, usually identi...

work page 1979
[4]

conversational

TheTestiMoleDataset TestiMole-Conversationalis part of a larger dataset originally created in order to provide the academic community with better resources to train different kind of language models employ- ing high-quality native Italian resources, also for long-context training. In this work, we decided to focus on the "conversational" subset as a scien...

work page
[5]

Conclusion In an historical period characterized by the rise of Large Language Models and the consequent quest for large and clean datasets,TestiMolestands out as an important resource for improving the ca- pabilities of natively Italian as well as multilingual LMs to correctly model peculiar elements of Ital- ian language and society, drawing from the ri...

work page
[6]

HARMONIA

Acknowledgements The work of V. Patti and M. Rinaldi have been par- tially supported by the “HARMONIA” project - M4- C2, I1.3 Partenariati Estesi - Cascade Call - FAIR - CUP C63C22000770006 - PE PE0000013 under the NextGenerationEU programme

work page
[7]

Further sources could have probably been retrieved, but we believe that the present corpus already represents a wide and rep- resentative sample of thisvarietyof CMC language

Limitations Given the substantial manual work involved in de- signing appropriate collection strategies from di- verseplatforms, itwasnotpossibletoincludeevery Italian discussion board in the corpus; neither it is possible to quantify the proportion of the collected resource over the total. Further sources could have probably been retrieved, but we believ...

work page
[8]

To mit- igate these risks, we anonymized all usernames from the corpus

Ethical considerations From an ethical standpoint, the collection of online conversationaldataraisesconcernsregardinguser privacy and consent, even when such content was publicly accessible at the time of collection. To mit- igate these risks, we anonymized all usernames from the corpus. We assume that users followed Figure 4: Top 50 newsgroups by total c...

work page 2022
[9]

• The get_url function connects to the server and attempts to retrieve the page

Appendix The script used to web-scrape the data is com- posed of four main functions: • Themainfunction, given an URL pattern de- fined by a prefix, a range, and a suffix, calls theget_urlfunction for all forum topics. • The get_url function connects to the server and attempts to retrieve the page. In particular, it uses Python’srequests library to down- ...

work page
[10]

Bibliographical References Giuseppe Antonelli. 2016. L’e-taliano tra storia e leggende. InL’e-taliano. Scriventi e scritture nell’era digitale, pages 11–28. Franco Cesati Ed- itore. Emanuele Ferdinando Barbera. 2013.Una intro- duzione ai NUNC: storia della creazione di un corpus, volume Molti occhi sono meglio di uno: saggi di linguistica generale 2008-12...

work page 2016
[11]

Qu. A. S. A. R. Manuel Barbera and Carla Marello. 2011. Tra scritto-parlato, Umgangssprache e comuni- cazione in rete: i corpora NUNC. In«Studi di GrammaticaItaliana»XXVII(2008, recte2011)= Per Giovanni Nencioni. Convegno Internazionale di Studi. Pisa - Firenze, 4-5 Maggio 2009, pages 157– 185. Le Lettere. Marco Baroni, Silvia Bernardini, Adriano Ferrares...

work page 2011
[12]

InProceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), pages 41–46, Turin, Italy

Long-term social media data collection at the university of turin. InProceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), pages 41–46, Turin, Italy. CEUR Workshop Proceedings. Vladimír Benko. 2014. Aranea: Yet another fam- ily of (comparable) web corpora. InInterna- tionalConferenceonText,Speech,andDialogue, pages 247–2...

work page 2018
[13]

Behavior research methods, instruments, & com- puters : a journal of the Psychonomic Society, Inc, 31:215–9

Large-scale databases of proper names. Behavior research methods, instruments, & com- puters : a journal of the Psychonomic Society, Inc, 31:215–9. Helmut Feldweg, Ralf Kibinger, and Christine Thie- len. 1995. Zum sprachgebrauch in deutschen news- gruppen. InNeue Medien. Osn- abrücker BeiträgezurSprachtheorie, pages143–154.Old- enburg: Red. OBST. Paolo Ga...

work page 1995
[14]

Roland Schäfer, Felix Bildhauer, et al

AlBERTo: Modeling Italian Social Media Language with BERT.IJCoL [Online]. Roland Schäfer, Felix Bildhauer, et al. 2012. Build- ing large corpora from the web using a new effi- cient tool chain. InLrec, pages 486–493. Jasmin Schröck and Harald Lüngen. 2015. Build- ing and annotating a corpus of german-language newsgroups. InNLP4CMC 2015. 2nd Workshop on Na...

work page 2012
[15]

2011.NUNC - A Multilanguage Suite of Newsgroups Cor- pora

Language Resource References Barbera, Manuel and Colombo, Simona and Marello, Carla. 2011.NUNC - A Multilanguage Suite of Newsgroups Cor- pora. Università degli Studi di Torino. PID http://www.bmanuel.org/projects/ng- HOME.html. Shaoul, Cyrus and Westbury, Chris. 2013.A reduced redundancy USENET corpus (2005- 2011). Edmonton, AB: University of Alberta. PI...

work page 2011

[1] [1]

Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Introduction Over the past three decades, a new form of writ- ten communication has emerged due to the dif- fusion of digital communication networks among the general public. This constituted a revolution- ary event in the history of written language, as the digital medium began to be massively used as a form of communication meant for ordinary, sponta- n...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Usenet as a text corpus

Related Work The potentiality of gathering large text corpora from discussion boards was already explored more than thirty years ago. (Lund and Burgess, 1996; Burgess and Livesay, 1998), notably, compiled the HAL Corpus by collecting 131 million words from UsenetoverthecourseofFebruary1995. TheHAL Corpus was used to train a model encoding seman- tic and g...

work page 1996

[3] [3]

posts”, are organized in “threads

TheTestiMole-Conversational Resource 3.1. Discussion Boards The text sources of theTestiMole- Conversationalcorpus are two types of discussion boards. Discussion boards are plat- forms where users can exchange messages on specific topics. Messages, called “posts”, are organized in “threads” that refer to a very specific topic of discussion, usually identi...

work page 1979

[4] [4]

conversational

TheTestiMoleDataset TestiMole-Conversationalis part of a larger dataset originally created in order to provide the academic community with better resources to train different kind of language models employ- ing high-quality native Italian resources, also for long-context training. In this work, we decided to focus on the "conversational" subset as a scien...

work page

[5] [5]

Conclusion In an historical period characterized by the rise of Large Language Models and the consequent quest for large and clean datasets,TestiMolestands out as an important resource for improving the ca- pabilities of natively Italian as well as multilingual LMs to correctly model peculiar elements of Ital- ian language and society, drawing from the ri...

work page

[6] [6]

HARMONIA

Acknowledgements The work of V. Patti and M. Rinaldi have been par- tially supported by the “HARMONIA” project - M4- C2, I1.3 Partenariati Estesi - Cascade Call - FAIR - CUP C63C22000770006 - PE PE0000013 under the NextGenerationEU programme

work page

[7] [7]

Further sources could have probably been retrieved, but we believe that the present corpus already represents a wide and rep- resentative sample of thisvarietyof CMC language

Limitations Given the substantial manual work involved in de- signing appropriate collection strategies from di- verseplatforms, itwasnotpossibletoincludeevery Italian discussion board in the corpus; neither it is possible to quantify the proportion of the collected resource over the total. Further sources could have probably been retrieved, but we believ...

work page

[8] [8]

To mit- igate these risks, we anonymized all usernames from the corpus

Ethical considerations From an ethical standpoint, the collection of online conversationaldataraisesconcernsregardinguser privacy and consent, even when such content was publicly accessible at the time of collection. To mit- igate these risks, we anonymized all usernames from the corpus. We assume that users followed Figure 4: Top 50 newsgroups by total c...

work page 2022

[9] [9]

• The get_url function connects to the server and attempts to retrieve the page

Appendix The script used to web-scrape the data is com- posed of four main functions: • Themainfunction, given an URL pattern de- fined by a prefix, a range, and a suffix, calls theget_urlfunction for all forum topics. • The get_url function connects to the server and attempts to retrieve the page. In particular, it uses Python’srequests library to down- ...

work page

[10] [10]

Bibliographical References Giuseppe Antonelli. 2016. L’e-taliano tra storia e leggende. InL’e-taliano. Scriventi e scritture nell’era digitale, pages 11–28. Franco Cesati Ed- itore. Emanuele Ferdinando Barbera. 2013.Una intro- duzione ai NUNC: storia della creazione di un corpus, volume Molti occhi sono meglio di uno: saggi di linguistica generale 2008-12...

work page 2016

[11] [11]

Qu. A. S. A. R. Manuel Barbera and Carla Marello. 2011. Tra scritto-parlato, Umgangssprache e comuni- cazione in rete: i corpora NUNC. In«Studi di GrammaticaItaliana»XXVII(2008, recte2011)= Per Giovanni Nencioni. Convegno Internazionale di Studi. Pisa - Firenze, 4-5 Maggio 2009, pages 157– 185. Le Lettere. Marco Baroni, Silvia Bernardini, Adriano Ferrares...

work page 2011

[12] [12]

InProceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), pages 41–46, Turin, Italy

Long-term social media data collection at the university of turin. InProceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), pages 41–46, Turin, Italy. CEUR Workshop Proceedings. Vladimír Benko. 2014. Aranea: Yet another fam- ily of (comparable) web corpora. InInterna- tionalConferenceonText,Speech,andDialogue, pages 247–2...

work page 2018

[13] [13]

Behavior research methods, instruments, & com- puters : a journal of the Psychonomic Society, Inc, 31:215–9

Large-scale databases of proper names. Behavior research methods, instruments, & com- puters : a journal of the Psychonomic Society, Inc, 31:215–9. Helmut Feldweg, Ralf Kibinger, and Christine Thie- len. 1995. Zum sprachgebrauch in deutschen news- gruppen. InNeue Medien. Osn- abrücker BeiträgezurSprachtheorie, pages143–154.Old- enburg: Red. OBST. Paolo Ga...

work page 1995

[14] [14]

Roland Schäfer, Felix Bildhauer, et al

AlBERTo: Modeling Italian Social Media Language with BERT.IJCoL [Online]. Roland Schäfer, Felix Bildhauer, et al. 2012. Build- ing large corpora from the web using a new effi- cient tool chain. InLrec, pages 486–493. Jasmin Schröck and Harald Lüngen. 2015. Build- ing and annotating a corpus of german-language newsgroups. InNLP4CMC 2015. 2nd Workshop on Na...

work page 2012

[15] [15]

2011.NUNC - A Multilanguage Suite of Newsgroups Cor- pora

Language Resource References Barbera, Manuel and Colombo, Simona and Marello, Carla. 2011.NUNC - A Multilanguage Suite of Newsgroups Cor- pora. Università degli Studi di Torino. PID http://www.bmanuel.org/projects/ng- HOME.html. Shaoul, Cyrus and Westbury, Chris. 2013.A reduced redundancy USENET corpus (2005- 2011). Edmonton, AB: University of Alberta. PI...

work page 2011