pith. sign in

arxiv: 2602.14819 · v2 · submitted 2026-02-16 · 💻 cs.CL

Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Pith reviewed 2026-05-15 21:39 UTC · model grok-4.3

classification 💻 cs.CL
keywords Italian corpusdiscussion boardslanguage modelingsociolinguistic researchlarge language modelscomputer-mediated communicationpre-training dataItalian language
0
0 comments X

The pith

A new 30-billion-word corpus of Italian discussion board messages supports native LLM pre-training and studies of online language use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Testimole-conversational, a collection of more than 30 billion words from Italian discussion boards spanning 1996 to 2024. This dataset is positioned as a high-quality resource for pre-training large language models in Italian, addressing the scarcity of native-language training data. It also enables examination of informal written Italian, discourse dynamics, and social interactions across nearly three decades of computer-mediated communication. The corpus is intended for applications in language modeling, domain adaptation, and conversational analysis while supporting investigations of language variation and digital social phenomena. The authors will release the resource freely to the research community.

Core claim

We present Testimole-conversational, a massive collection of discussion board messages in Italian totaling more than 30 billion word-tokens from 1996 to 2024. This corpus serves as an ideal dataset for pre-training native Italian large language models. Discussion board messages also provide a relevant resource for linguistic and sociological analysis by capturing computer-mediated communication, informal written Italian, discourse dynamics, and online social interaction over a wide time span.

What carries the argument

The Testimole-conversational corpus, a compiled set of Italian discussion board messages that aggregates informal digital texts across decades for language model training and sociolinguistic study.

If this is right

  • It enables pre-training of large language models using authentic native Italian data rather than translated material.
  • It supports tracking of language variation and change in informal written Italian over nearly thirty years.
  • It provides material for analysis of discourse patterns and social interactions in online discussion environments.
  • It facilitates domain adaptation and conversational analysis tasks in natural language processing.
  • The long time span permits studies of how digital communication has evolved from 1996 to 2024.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar large-scale corpora could be assembled for other languages that currently lack sufficient native training data.
  • Models trained on this data may capture informal registers and generational language features better than those relying on formal texts.
  • Cross-comparison with English forum corpora could highlight language-specific patterns in online social behavior.
  • Future extensions might link messages to shared media to study multimodal aspects of digital conversation.

Load-bearing premise

The collected and processed messages form a clean, representative, and high-quality dataset suitable for model training without major biases or errors introduced during collection.

What would settle it

A direct comparison showing that models trained on this corpus perform no better than those trained on smaller, manually curated Italian datasets, or evidence of widespread unfiltered spam, duplicates, or non-Italian content.

Figures

Figures reproduced from arXiv: 2602.14819 by Matteo Rinaldi, Rossella Varvara, Viviana Patti.

Figure 1
Figure 1. Figure 1: Total corpus size (in MB) per year. Forum [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Usenet - Number of tokens per year 0 500000000 1E+09 1,5E+09 2E+09 2,5E+09 1991 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 Tokens Year [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Forums - Number of tokens per year this type of discussion boards has probably lost suc￾cess, with other platforms (such as social networks) becoming digital places for discussions. It is inter￾esting to note that the distribution of tokens among years differ for the two subgroups of data: Usenet data reach the peak of tokens in 2003, while for forums data we have the highest number of tokens for the year … view at source ↗
Figure 4
Figure 4. Figure 4: Top 50 newsgroups by total character count (all periods combined). [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Top 50 forums by total character count (all periods combined). [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Normalized frequencies of six words across time in the [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards' messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents 'Testimole-conversational', a corpus of more than 30 billion word tokens drawn from Italian discussion-board messages spanning 1996–2024. It positions the resource as suitable for pre-training Italian LLMs and for sociolinguistic analysis of computer-mediated communication, informal written Italian, and online social dynamics.

Significance. A large-scale, temporally extended Italian conversational corpus would address a clear gap in resources for low-resource language modeling and digital sociolinguistics. If the data are shown to be sufficiently clean and representative, the release could support improved domain adaptation, conversational modeling, and longitudinal studies of language variation.

major comments (2)
  1. [Abstract] Abstract: The central claim that the corpus 'renders it an ideal dataset for native Italian Large Language Models' pre-training' is unsupported because the manuscript provides no description of collection methods, language identification, length or perplexity filtering, MinHash/exact deduplication, or any post-processing steps that would remove spam, boilerplate, code, or non-Italian text.
  2. [Corpus construction] Corpus construction section (or equivalent): No quantitative metrics are reported on the fraction of tokens retained after cleaning, the rate of near-duplicates, or the distribution of message lengths and perplexity scores. These statistics are required to convert the headline 30 B token figure into a usable training-token count.
minor comments (2)
  1. [Abstract] Abstract: 'Models'pre-training' is missing a space; 'it also support investigations' should read 'it also supports investigations'.
  2. [Abstract] The abstract states the corpus 'will be made freely available' but supplies no license, access URL, or citation format; these details should appear in a dedicated 'Availability' paragraph.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comments point by point below and will revise the manuscript to incorporate the requested details on corpus construction and supporting metrics.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that the corpus 'renders it an ideal dataset for native Italian Large Language Models' pre-training' is unsupported because the manuscript provides no description of collection methods, language identification, length or perplexity filtering, MinHash/exact deduplication, or any post-processing steps that would remove spam, boilerplate, code, or non-Italian text.

    Authors: We agree that the abstract claim would be strengthened by explicit reference to the underlying methods. In the revised manuscript we will shorten the abstract claim slightly and add a brief clause noting the use of language identification, length/perplexity filtering, and deduplication. We will also insert a new dedicated 'Corpus Construction' section that fully describes data sourcing, language identification, filtering criteria, MinHash and exact deduplication, and post-processing steps for spam, boilerplate, code, and non-Italian content. revision: yes

  2. Referee: [Corpus construction] Corpus construction section (or equivalent): No quantitative metrics are reported on the fraction of tokens retained after cleaning, the rate of near-duplicates, or the distribution of message lengths and perplexity scores. These statistics are required to convert the headline 30 B token figure into a usable training-token count.

    Authors: We acknowledge that these quantitative metrics are necessary for readers to assess the effective training volume. We will add a subsection within the new 'Corpus Construction' section that reports (i) token retention rates after each cleaning stage, (ii) the fraction of near-duplicates removed via MinHash and exact matching, and (iii) summary statistics and distributions for message lengths and perplexity scores. These additions will allow direct conversion of the raw 30 B token count into a usable training-token estimate. revision: yes

Circularity Check

0 steps flagged

No significant circularity in corpus resource announcement

full rationale

This is a data resource paper announcing the collection and release of a 30B-token Italian discussion-board corpus. It contains no equations, derivations, predictions, fitted parameters, or uniqueness theorems. The central claim is descriptive (the corpus exists and has the stated size and time span) and does not reduce to any self-referential input by construction. No self-citations are load-bearing for any derivation, and no ansatz or renaming of known results occurs. The paper is self-contained as a corpus description; absence of quality-filtering details is a completeness issue, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data resource paper with no mathematical model, derivations, or theoretical claims. No free parameters, axioms, or invented entities are present.

pith-pipeline@v0.9.0 · 5445 in / 1056 out tokens · 23243 ms · 2026-05-15T21:39:41.635472+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

  1. [1]

    Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

    Introduction Over the past three decades, a new form of writ- ten communication has emerged due to the dif- fusion of digital communication networks among the general public. This constituted a revolution- ary event in the history of written language, as the digital medium began to be massively used as a form of communication meant for ordinary, sponta- n...

  2. [2]

    Usenet as a text corpus

    Related Work The potentiality of gathering large text corpora from discussion boards was already explored more than thirty years ago. (Lund and Burgess, 1996; Burgess and Livesay, 1998), notably, compiled the HAL Corpus by collecting 131 million words from UsenetoverthecourseofFebruary1995. TheHAL Corpus was used to train a model encoding seman- tic and g...

  3. [3]

    posts”, are organized in “threads

    TheTestiMole-Conversational Resource 3.1. Discussion Boards The text sources of theTestiMole- Conversationalcorpus are two types of discussion boards. Discussion boards are plat- forms where users can exchange messages on specific topics. Messages, called “posts”, are organized in “threads” that refer to a very specific topic of discussion, usually identi...

  4. [4]

    conversational

    TheTestiMoleDataset TestiMole-Conversationalis part of a larger dataset originally created in order to provide the academic community with better resources to train different kind of language models employ- ing high-quality native Italian resources, also for long-context training. In this work, we decided to focus on the "conversational" subset as a scien...

  5. [5]

    Conclusion In an historical period characterized by the rise of Large Language Models and the consequent quest for large and clean datasets,TestiMolestands out as an important resource for improving the ca- pabilities of natively Italian as well as multilingual LMs to correctly model peculiar elements of Ital- ian language and society, drawing from the ri...

  6. [6]

    HARMONIA

    Acknowledgements The work of V. Patti and M. Rinaldi have been par- tially supported by the “HARMONIA” project - M4- C2, I1.3 Partenariati Estesi - Cascade Call - FAIR - CUP C63C22000770006 - PE PE0000013 under the NextGenerationEU programme

  7. [7]

    Further sources could have probably been retrieved, but we believe that the present corpus already represents a wide and rep- resentative sample of thisvarietyof CMC language

    Limitations Given the substantial manual work involved in de- signing appropriate collection strategies from di- verseplatforms, itwasnotpossibletoincludeevery Italian discussion board in the corpus; neither it is possible to quantify the proportion of the collected resource over the total. Further sources could have probably been retrieved, but we believ...

  8. [8]

    To mit- igate these risks, we anonymized all usernames from the corpus

    Ethical considerations From an ethical standpoint, the collection of online conversationaldataraisesconcernsregardinguser privacy and consent, even when such content was publicly accessible at the time of collection. To mit- igate these risks, we anonymized all usernames from the corpus. We assume that users followed Figure 4: Top 50 newsgroups by total c...

  9. [9]

    • The get_url function connects to the server and attempts to retrieve the page

    Appendix The script used to web-scrape the data is com- posed of four main functions: • Themainfunction, given an URL pattern de- fined by a prefix, a range, and a suffix, calls theget_urlfunction for all forum topics. • The get_url function connects to the server and attempts to retrieve the page. In particular, it uses Python’srequests library to down- ...

  10. [10]

    Bibliographical References Giuseppe Antonelli. 2016. L’e-taliano tra storia e leggende. InL’e-taliano. Scriventi e scritture nell’era digitale, pages 11–28. Franco Cesati Ed- itore. Emanuele Ferdinando Barbera. 2013.Una intro- duzione ai NUNC: storia della creazione di un corpus, volume Molti occhi sono meglio di uno: saggi di linguistica generale 2008-12...

  11. [11]

    Qu. A. S. A. R. Manuel Barbera and Carla Marello. 2011. Tra scritto-parlato, Umgangssprache e comuni- cazione in rete: i corpora NUNC. In«Studi di GrammaticaItaliana»XXVII(2008, recte2011)= Per Giovanni Nencioni. Convegno Internazionale di Studi. Pisa - Firenze, 4-5 Maggio 2009, pages 157– 185. Le Lettere. Marco Baroni, Silvia Bernardini, Adriano Ferrares...

  12. [12]

    InProceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), pages 41–46, Turin, Italy

    Long-term social media data collection at the university of turin. InProceedings of the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), pages 41–46, Turin, Italy. CEUR Workshop Proceedings. Vladimír Benko. 2014. Aranea: Yet another fam- ily of (comparable) web corpora. InInterna- tionalConferenceonText,Speech,andDialogue, pages 247–2...

  13. [13]

    Behavior research methods, instruments, & com- puters : a journal of the Psychonomic Society, Inc, 31:215–9

    Large-scale databases of proper names. Behavior research methods, instruments, & com- puters : a journal of the Psychonomic Society, Inc, 31:215–9. Helmut Feldweg, Ralf Kibinger, and Christine Thie- len. 1995. Zum sprachgebrauch in deutschen news- gruppen. InNeue Medien. Osn- abrücker BeiträgezurSprachtheorie, pages143–154.Old- enburg: Red. OBST. Paolo Ga...

  14. [14]

    Roland Schäfer, Felix Bildhauer, et al

    AlBERTo: Modeling Italian Social Media Language with BERT.IJCoL [Online]. Roland Schäfer, Felix Bildhauer, et al. 2012. Build- ing large corpora from the web using a new effi- cient tool chain. InLrec, pages 486–493. Jasmin Schröck and Harald Lüngen. 2015. Build- ing and annotating a corpus of german-language newsgroups. InNLP4CMC 2015. 2nd Workshop on Na...

  15. [15]

    2011.NUNC - A Multilanguage Suite of Newsgroups Cor- pora

    Language Resource References Barbera, Manuel and Colombo, Simona and Marello, Carla. 2011.NUNC - A Multilanguage Suite of Newsgroups Cor- pora. Università degli Studi di Torino. PID http://www.bmanuel.org/projects/ng- HOME.html. Shaoul, Cyrus and Westbury, Chris. 2013.A reduced redundancy USENET corpus (2005- 2011). Edmonton, AB: University of Alberta. PI...