The Schwurbelarchiv: a German Language Telegram dataset for the Study of Conspiracy Theories

Elisabeth Hoeldrich; Jana Lasser; Joao Pinheiro Neto; Mathias Angermaier

arxiv: 2504.06318 · v3 · submitted 2025-04-08 · 💻 cs.SI

The Schwurbelarchiv: a German Language Telegram dataset for the Study of Conspiracy Theories

Mathias Angermaier , Elisabeth Hoeldrich , Jana Lasser , Joao Pinheiro Neto This is my paper

Pith reviewed 2026-05-22 21:08 UTC · model grok-4.3

classification 💻 cs.SI

keywords Telegram datasetconspiracy theoriesGerman languagesocial mediamisinformationaudio transcriptiononline discourse

0 comments

The pith

A cleaned Telegram archive supplies 63 million German messages plus 3 million transcriptions for conspiracy research.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes how an inaccessible raw collection of German Telegram content was parsed, cleaned, validated, and transcribed to create a usable dataset. The resulting resource covers more than 5,800 groups and channels and 63 million messages, with transcriptions of over 3 million audio and video files that together represent roughly 126,000 hours of spoken material. Validation rests on linguistic and temporal markers that confirm the predominantly German origin and conspiracy-related focus. The authors argue this turns a data hoard into a structured collection suited for studying misinformation, extremism, opinion shifts, and network patterns in German-language online discourse.

Core claim

By parsing, cleaning, validating, pseudonymising user data, and transcribing the raw Schwurbelarchiv archive, the authors produce a structured dataset of 63 million messages from over 5,800 Telegram groups and channels that is ready for systematic study of German conspiracy-theory discourse and supports text analysis of originally multimodal content.

What carries the argument

The processing pipeline that converts the anonymous Schwurbelarchiv raw archive into a pseudonymised, validated collection with full transcriptions of audio and video files.

If this is right

Researchers can now examine text originally spoken in voice messages and videos rather than only typed posts.
The resource directly supports work on misinformation spread, political extremism, opinion adaptation, and social network structures within German-language Telegram communities.
The dataset fills a gap for systematic, large-scale analysis of one-to-many and interactive communication on Telegram that was previously unavailable in processed form.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The transcriptions open the possibility of comparing spoken and written styles within the same conspiracy communities.
Temporal markers in the data could support tracking how specific narratives evolve or spread across channels over time.
The scale and multimodal coverage may enable quantitative tests of how voice content influences engagement compared with text-only posts.

Load-bearing premise

The archived material comes predominantly from German-language conspiracy-theory discourse, as indicated only by language and time markers.

What would settle it

A sample showing that most messages are not in German or not focused on conspiracy theories would falsify the dataset description.

read the original abstract

Sociality borne by language, as is the predominant digital trace on text-based social media platforms, harbours the raw material for exploring a multitude of social phenomena. Distinctively, the messaging service Telegram provides functionalities that allow for socially interactive as well as one-to-many communication. Our Telegram dataset contains over 5,800 groups and channels and 63 million messages, originating from a data-hoarding initiative named the ``Schwurbelarchiv'' (from German schwurbeln: speaking nonsense). Uniquely, it includes the transcriptions of over 3 million audio and video files. While the raw data was previously archived on the Internet Archive by an anonymous data hoarder, it was stored in a format that is difficult to process and largely inaccessible for systematic research. Our contribution consists of parsing, cleaning, and validating this raw archive, pseudonymising user data, and transcribing roughly 126,000 hours of audio and video content, thereby transforming this data hoard into a structured, research-ready dataset. This dataset publication details the structure, scope, and methodological specifics of the Schwurbelarchiv, emphasising its relevance for further research on the German-language conspiracy-theory-related discourse. We validate its predominantly German origin by linguistic and temporal markers and situate it within the context of similar datasets. We describe process and extent of the transcription of multimedia files. Thanks to this effort the dataset uniquely supports analysis of text from originally multimodal sources like voice messages and videos to investigate online social dynamics and content dissemination. Researchers can employ this resource to explore societal dynamics related to misinformation, political extremism, opinion adaptation, and social network structures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript describes the construction of the Schwurbelarchiv dataset from a raw Telegram archive previously hosted on the Internet Archive. It contains data from over 5,800 groups and channels, 63 million messages, and transcriptions of roughly 3 million audio and video files (approximately 126,000 hours). The authors detail the parsing, cleaning, validation, and pseudonymization steps, along with the transcription process, and position the resulting structured dataset as a resource for research on German-language conspiracy-theory discourse, misinformation, political extremism, opinion adaptation, and social network structures. They validate the predominantly German origin via linguistic and temporal markers and compare the resource to existing datasets.

Significance. If the content composition and transcription quality hold, the dataset would offer a substantial contribution to computational social science by providing large-scale access to originally multimodal German-language Telegram data that was previously difficult to process. The scale (63M messages plus 3M transcripts) and focus on an under-represented language community enable studies of voice-message dissemination and conspiracy-related social dynamics that are not feasible with text-only English-centric corpora. The explicit description of the processing pipeline from raw archive to research-ready format supports reproducibility for similar data-hoarding collections.

major comments (2)

[Abstract] Abstract: The manuscript claims relevance for 'research on the German-language conspiracy-theory-related discourse' and states that the data originate from the 'Schwurbelarchiv' (nonsense-speaking archive). However, the only validation reported is that the material is 'predominantly German origin by linguistic and temporal markers.' No sampling results, keyword-based filtering statistics, manual review of message content, or automated classification metrics are provided to establish that the 63M messages and 3M transcripts predominantly concern conspiracy theories rather than other German-language topics. This assumption is load-bearing for the central positioning and intended research uses.
[Transcription process description] Transcription process description: The paper reports transcribing roughly 126,000 hours of audio and video content to produce 3 million transcripts but supplies no quantitative evaluation of transcription accuracy, such as word error rate on a held-out test set, comparison against human transcripts, or error-rate estimates stratified by audio quality or speaker characteristics. Without such metrics, the reliability of the transcribed portion for downstream text-based analyses cannot be assessed.

minor comments (2)

[Dataset release section] Dataset release section: Specify the exact file formats, directory structure, and schema fields (e.g., how original message IDs map to pseudonymized user identifiers and how transcript files are linked to the corresponding message records).
[Related-work paragraph] Related-work paragraph: Update the comparison to include any Telegram datasets released after 2023 that also incorporate multimedia or non-English content to strengthen the novelty claim.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive review. We address the two major comments point by point below, agreeing where revisions are warranted while defending the manuscript on the basis of the data's provenance.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript claims relevance for 'research on the German-language conspiracy-theory-related discourse' and states that the data originate from the 'Schwurbelarchiv' (nonsense-speaking archive). However, the only validation reported is that the material is 'predominantly German origin by linguistic and temporal markers.' No sampling results, keyword-based filtering statistics, manual review of message content, or automated classification metrics are provided to establish that the 63M messages and 3M transcripts predominantly concern conspiracy theories rather than other German-language topics. This assumption is load-bearing for the central positioning and intended research uses.

Authors: The Schwurbelarchiv collection was assembled by an anonymous data hoarder with the explicit aim of archiving German Telegram channels and groups focused on conspiracy theories and related 'schwurbeln' discourse, as reflected in the archive name and its documented purpose. Our paper processes and structures this pre-existing collection rather than performing independent content curation. The linguistic and temporal validation confirms the German-language scope matching the collection's intent. We will revise the manuscript to expand the description of the archive's origin and collection rationale in the introduction and methods, providing additional context on why the dataset is positioned for conspiracy-theory research. revision: yes
Referee: [Transcription process description] Transcription process description: The paper reports transcribing roughly 126,000 hours of audio and video content to produce 3 million transcripts but supplies no quantitative evaluation of transcription accuracy, such as word error rate on a held-out test set, comparison against human transcripts, or error-rate estimates stratified by audio quality or speaker characteristics. Without such metrics, the reliability of the transcribed portion for downstream text-based analyses cannot be assessed.

Authors: We agree that explicit discussion of transcription quality would strengthen the paper. The 3 million transcripts were generated via automated speech recognition applied at scale to the multimedia files. Because no held-out test set was reserved during the original pipeline, new empirical metrics such as WER cannot be computed without substantial additional work. In revision we will describe the transcription model employed, cite published German-language accuracy benchmarks for that model, and add a limitations subsection addressing potential error sources and their relevance for downstream use. revision: partial

standing simulated objections not resolved

Provision of new quantitative transcription accuracy metrics (e.g., word error rate on a held-out test set or stratified error estimates) because no such evaluation was performed during dataset creation.

Circularity Check

0 steps flagged

No circularity: dataset construction paper contains no derivations or fitted predictions

full rationale

The manuscript describes parsing, cleaning, pseudonymising and transcribing a raw Telegram archive into a structured dataset. No equations, parameters, predictions, or uniqueness claims appear. Validation of German origin is stated via linguistic and temporal markers with no reduction to self-defined inputs or self-citations. The central contribution is the data-processing pipeline itself, which is self-contained and externally verifiable by inspecting the released dataset. No load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset release paper describing data processing steps with no mathematical models, free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5840 in / 1096 out tokens · 46076 ms · 2026-05-22T21:08:42.642406+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TeraGram: A Structured Longitudinal Dataset of the Telegram Messenger
physics.soc-ph 2026-05 unverdicted novelty 6.0

A large-scale longitudinal dataset of public Telegram content is introduced to enable studies of engagement patterns and network evolution without algorithmic curation.
Infini-News: Efficiently Queryable Access to 1.3 Billion Processed Common Crawl News Articles
cs.CL 2026-05 unverdicted novelty 5.0

Infini-News delivers a cleaned, enriched, and efficiently queryable index over the full CC-News archive with language and country attribution for 1.35 billion articles.