GerAV: Towards New Heights in German Authorship Verification using Fine-Tuned LLMs on a New Benchmark

Christoph Leiter; Elena Schmidt; Lotta Kiefer; Sotaro Takeshita; Steffen Eger

arxiv: 2601.13711 · v2 · submitted 2026-01-20 · 💻 cs.CL

GerAV: Towards New Heights in German Authorship Verification using Fine-Tuned LLMs on a New Benchmark

Lotta Kiefer , Christoph Leiter , Sotaro Takeshita , Elena Schmidt , Steffen Eger This is my paper

Pith reviewed 2026-05-16 13:04 UTC · model grok-4.3

classification 💻 cs.CL

keywords authorship verificationGermanlarge language modelsbenchmarkfine-tuningTwitterRedditcross-domain

0 comments

The pith

A fine-tuned large language model outperforms recent baselines by 0.09 F1 on German authorship verification using a new benchmark of over 400,000 text pairs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GerAV, a large benchmark of German text pairs drawn from Twitter and Reddit for the task of authorship verification. It evaluates strong baselines and state-of-the-art models on splits that isolate effects of data source, domain, and length. The central finding is that fine-tuning a large language model on the provided training data yields the highest scores, exceeding both prior methods and zero-shot GPT-5. The work also documents a specialization-generalization trade-off: models excel when test conditions match their training source but lose ground across different regimes, with mixing sources offering partial mitigation. This establishes a controlled testbed for non-English authorship verification that was previously lacking.

Core claim

GerAV supplies over 400,000 labeled German text pairs, split into Twitter, Reddit in-domain, cross-domain, and profile-based subsets. Fine-tuned large language models trained on these splits achieve up to 0.09 higher absolute F1 than recent baselines and 0.08 higher than GPT-5 in zero-shot mode. Models trained on one data regime perform best under matching conditions yet generalize less well to other regimes; combining multiple training sources reduces this gap while preserving high matched-condition accuracy.

What carries the argument

The GerAV benchmark of Twitter and Reddit text pairs, with controlled subsets for source, domain, and length, serving as training and test data for fine-tuned large language models in authorship verification.

If this is right

Models achieve peak accuracy only when training and test data share the same source and domain.
Mixing training sources from Twitter and Reddit improves cross-regime generalization.
The benchmark enables separate measurement of how text length and topical domain affect verification difficulty.
Fine-tuning outperforms zero-shot use of GPT-5 by a clear margin on this German data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same data-construction approach could be replicated for other languages using local social-media corpora to create comparable benchmarks.
The observed specialization-generalization trade-off suggests that future systems may need explicit multi-domain training objectives rather than single-source fine-tuning.
If platform artifacts prove influential, the benchmark could be used to quantify how much social-media style differs from formal writing in German.

Load-bearing premise

The Twitter and Reddit text pairs accurately reflect real-world German authorship style without significant platform-specific artifacts or labeling errors.

What would settle it

Performance of the fine-tuned models dropping below baseline levels when evaluated on a held-out set of verified German literary or journalistic texts from non-social-media sources.

read the original abstract

Authorship verification (AV) is the task of determining whether two texts were written by the same author and has been studied extensively, predominantly for English data. In contrast, large-scale benchmarks and systematic evaluations for other languages remain scarce. We address this gap by introducing GerAV, a comprehensive benchmark for German AV comprising over 400k labeled text pairs. GerAV is built from Twitter and Reddit data, with the Reddit part further divided into in-domain and cross-domain message-based subsets, as well as a profile-based subset. This design enables controlled analysis of the effects of data source, topical domain, and text length. Using the provided training splits, we conduct a systematic evaluation of strong baselines and state-of-the-art models and find that our best approach, a fine-tuned large language model, outperforms recent baselines by up to 0.09 absolute F1 score and surpasses GPT-5 in a zero-shot setting by 0.08. We further observe a trade-off between specialization and generalization: models trained on specific data types perform best under matching conditions but generalize less well across data regimes, a limitation that can be mitigated by combining training sources. Overall, GerAV provides a challenging and versatile benchmark for advancing research on German and cross-domain AV. Our code and information about data access are available on GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces GerAV, a new benchmark for German authorship verification consisting of over 400k labeled text pairs drawn from Twitter and Reddit sources. It defines in-domain, cross-domain, and profile-based subsets to enable controlled analysis of data source, topical domain, and text length effects. Systematic experiments show that a fine-tuned LLM outperforms recent baselines by up to 0.09 absolute F1 and GPT-5 zero-shot by 0.08, while documenting a specialization-generalization trade-off that improves when training sources are combined. Code and data access details are released on GitHub.

Significance. If the benchmark validity holds, GerAV fills a clear gap in non-English AV resources and supplies the first large-scale, multi-regime testbed for German. The reported F1 gains from fine-tuning demonstrate that LLM specialization can deliver concrete improvements over both traditional baselines and large zero-shot models, with the cross-domain results offering a template for handling domain shift in authorship tasks more broadly.

major comments (1)

[§3] §3 (Benchmark Construction): The headline 0.09 F1 and 0.08 GPT-5 gains rest on the assumption that Twitter/Reddit pairs isolate authorship style. The construction uses platform-specific signals (character limits, hashtags, subreddit topics, posting patterns) without reported ablations for length normalization, emoji removal, or adversarial topic controls. Cross-domain splits mitigate some risk but do not substitute for explicit isolation tests; this directly affects whether the deltas reflect genuine AV progress or artifact exploitation.

minor comments (2)

[Abstract] Abstract and §5: The phrase 'up to 0.09' is not tied to a specific baseline or regime; adding the exact comparison (e.g., 'vs. X on the profile-based split') would improve clarity.
[§4] §4 (Experimental Setup): Training details for the fine-tuned LLM (learning rate, epochs, prompt template) are referenced but not fully tabulated; a concise hyper-parameter table would aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the concern regarding benchmark construction and potential platform artifacts below, and we will incorporate additional validation steps in the revision.

read point-by-point responses

Referee: [§3] §3 (Benchmark Construction): The headline 0.09 F1 and 0.08 GPT-5 gains rest on the assumption that Twitter/Reddit pairs isolate authorship style. The construction uses platform-specific signals (character limits, hashtags, subreddit topics, posting patterns) without reported ablations for length normalization, emoji removal, or adversarial topic controls. Cross-domain splits mitigate some risk but do not substitute for explicit isolation tests; this directly affects whether the deltas reflect genuine AV progress or artifact exploitation.

Authors: We appreciate the referee's emphasis on isolating authorship style from platform-specific signals. GerAV's design incorporates several controls to address this: the profile-based subset draws multiple texts from the same author across varied contexts to reduce topical confounding; cross-domain splits explicitly evaluate generalization across topical domains and data sources; and subsets are stratified by text length to enable analysis of length effects. We agree that explicit ablations for length normalization, emoji removal, and adversarial topic controls were not reported. To strengthen the claim that performance gains reflect authorship verification rather than artifacts, we will add these experiments in the revised manuscript, including preprocessing variants and corresponding F1 results on the in-domain, cross-domain, and profile-based splits. This will provide direct evidence on the contribution of each factor. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark creation and model evaluation

full rationale

The paper introduces GerAV as a new benchmark from Twitter and Reddit data and reports empirical F1 scores from fine-tuned LLMs versus baselines and GPT-5 zero-shot. No equations, derivations, fitted parameters renamed as predictions, or self-referential steps exist. Performance deltas are measured on held-out splits of the newly constructed dataset; the benchmark construction and evaluation do not reduce to any input by construction. Self-citations, if present, are not load-bearing for any claimed result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard assumptions about social-media text reflecting author style and on the correctness of the provided labels; no free parameters are fitted to produce the reported F1 numbers, and no new entities are postulated.

axioms (1)

domain assumption Social media posts from the same user share detectable stylistic features across topics and platforms
Invoked when constructing in-domain versus cross-domain splits and when claiming generalization limits.

pith-pipeline@v0.9.0 · 5552 in / 1244 out tokens · 45775 ms · 2026-05-16T13:04:43.747084+00:00 · methodology

GerAV: Towards New Heights in German Authorship Verification using Fine-Tuned LLMs on a New Benchmark

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)