pith. sign in

arxiv: 1906.10551 · v1 · pith:2VVRT326new · submitted 2019-06-24 · 💻 cs.LG · cs.CL· stat.ML

Assessing the Applicability of Authorship Verification Methods

Pith reviewed 2026-05-25 17:33 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML
keywords authorship verificationdigital text forensicsapplicability assessmentcross-topic verificationforensic corporatext classificationmachine learning evaluation
0
0 comments X

The pith

Authorship verification methods handle short informal chats and time-separated documents but all fail on cross-topic cases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to assess which existing authorship verification methods can be applied in real forensic investigations by first defining clear criteria and properties for characterizing such approaches. It then evaluates twelve methods, including current state-of-the-art ones, after training and optimization on three self-compiled corpora that each isolate a different practical challenge. The tests show some methods reaching 72.7 percent accuracy on 250-character informal chat conversations and over 75 percent accuracy on scientific documents separated by an average of 15.6 years. At the same time every method examined proves vulnerable when the documents come from different topics. This work matters because it supplies concrete performance data on applicability rather than leaving forensic practitioners to guess which techniques will hold up outside controlled research settings.

Core claim

By proposing explicit criteria and properties to characterize AV approaches and then training, optimizing and evaluating twelve existing methods on three self-compiled corpora that each target a distinct aspect of forensic applicability, the paper shows that part of the methods succeed on very challenging cases such as 250-character informal chat conversations at 72.7 percent accuracy and scientific documents written an average of 15.6 years apart at over 75 percent accuracy, while establishing that all methods are prone to failure in cross-topic verification cases.

What carries the argument

Three self-compiled corpora, each constructed to isolate one forensic applicability factor (short informal length, temporal separation, and topic variation), used to train optimize and test twelve AV methods.

If this is right

  • Methods that reach 72.7 percent on 250-character chats become candidates for forensic analysis of short informal messaging logs.
  • Methods exceeding 75 percent on documents separated by 15.6 years on average can be considered for cases involving writing produced years apart.
  • The universal weakness on cross-topic cases implies that any deployable AV system must incorporate safeguards or additional features that reduce topic dependence.
  • The proposed characterization criteria supply a repeatable basis for comparing new AV methods against the same forensic dimensions.
  • Forensic use of AV requires matching the chosen method to the expected document characteristics rather than treating all methods as interchangeable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If cross-topic failure is the dominant limitation, pairing AV methods with separate topic classifiers could improve reliability in mixed-topic investigations.
  • The same three-corpus testing design could be applied to non-English texts or additional genres to map method applicability more broadly.
  • The reported accuracies suggest AV outputs would function best as one piece of supporting evidence rather than decisive proof in legal settings.
  • Practitioners could adopt the paper's criteria as a checklist when selecting or developing AV tools for specific case types.

Load-bearing premise

The three self-compiled corpora accurately capture the distributions of document length, temporal separation, and topic variation that occur in real forensic verification tasks.

What would settle it

A new test showing that one or more of the twelve methods reaches high accuracy on cross-topic pairs drawn from the same three corpora, or external data showing that the corpora diverge substantially from actual forensic case distributions.

Figures

Figures reproduced from arXiv: 1906.10551 by Christian Winter, Lukas Graner, Oren Halvani.

Figure 1
Figure 1. Figure 1: The three possible model categories of authorship verification approaches. Here, U refers to the instance (for example, a document or a feature vector) of the unknown author. A is the target class (known author) and ¬A the outlier class (any other possible author). In the binary-intrinsic case, ρ denotes the verification problem (subject of classification), and Y and N denote the regions of the problem fea… view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation results for the four versions of the test corpus CPerv in terms of c@1. AV Method TP FN FP TN Total (Y/N/UP) GLAD 203 127 53 277 (256/404/0) Caravel 225 103 104 226 (329/329/2) Unmasking 158 169 56 272 (214/441/5) AVeer 56 274 6 324 (62/598/0) NNCD 40 290 3 327 (43/617/0) COAV 328 2 325 5 (653/7/0) [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ROC curves for GLAD, Caravel and COAV (applied on the four corpora versions of CPerv). The circles and triangles depict the current and maximum achievable c@1 values on the corpus, respectively. Note that Caravel’s thresholds always lie along the EER-line. [13] Ángel Hernández-Castañeda and Hiram Calvo. 2017. Author Verification Using a Semantic Space Model. Computación y Sistemas 21, 2 (2017). [14] David … view at source ↗
read the original abstract

Authorship verification (AV) is a research subject in the field of digital text forensics that concerns itself with the question, whether two documents have been written by the same person. During the past two decades, an increasing number of proposed AV approaches can be observed. However, a closer look at the respective studies reveals that the underlying characteristics of these methods are rarely addressed, which raises doubts regarding their applicability in real forensic settings. The objective of this paper is to fill this gap by proposing clear criteria and properties that aim to improve the characterization of existing and future AV approaches. Based on these properties, we conduct three experiments using 12 existing AV approaches, including the current state of the art. The examined methods were trained, optimized and evaluated on three self-compiled corpora, where each corpus focuses on a different aspect of applicability. Our results indicate that part of the methods are able to cope with very challenging verification cases such as 250 characters long informal chat conversations (72.7% accuracy) or cases in which two scientific documents were written at different times with an average difference of 15.6 years (> 75% accuracy). However, we also identified that all involved methods are prone to cross-topic verification cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes criteria and properties for characterizing authorship verification (AV) methods and evaluates 12 existing approaches (including state-of-the-art) on three self-compiled corpora, each designed to target a distinct aspect of applicability in forensic settings. It reports that certain methods achieve 72.7% accuracy on 250-character informal chat conversations and >75% accuracy on scientific documents separated by an average of 15.6 years, while all methods fail on cross-topic cases.

Significance. If the self-compiled corpora faithfully represent real forensic distributions, the results would usefully demonstrate that some AV methods can handle extreme length and temporal constraints while exposing a shared vulnerability to topic shifts, thereby guiding method selection and future development in digital text forensics.

major comments (2)
  1. [Corpus construction (Section 3)] Corpus construction (Section 3): the three self-compiled corpora are load-bearing for the applicability conclusions, yet no quantitative anchoring is supplied (topic entropy, length histograms, or temporal-gap statistics) against established forensic collections; without this, the reported accuracies may reflect residual topic leakage rather than authorship signal.
  2. [Results (Section 4 and abstract)] Results (Section 4 and abstract): the specific accuracy figures (72.7% on the chat corpus, >75% on the temporal corpus) are presented without statistical significance tests, confidence intervals, or baseline comparisons, leaving the empirical support for the central claims only moderately robust.
minor comments (1)
  1. [Properties and criteria] The exact operational definition of 'cross-topic' pairs should be stated more explicitly when the properties are introduced, to ensure reproducibility of the failure case.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses
  1. Referee: Corpus construction (Section 3): the three self-compiled corpora are load-bearing for the applicability conclusions, yet no quantitative anchoring is supplied (topic entropy, length histograms, or temporal-gap statistics) against established forensic collections; without this, the reported accuracies may reflect residual topic leakage rather than authorship signal.

    Authors: Our three corpora were constructed to isolate distinct forensic challenges (extreme brevity in informal text, long temporal gaps in scientific writing, and explicit cross-topic shifts) that are not simultaneously represented in standard collections such as PAN. The cross-topic corpus enforces topic separation by design, and the uniform failure of all 12 methods on it indicates that topic leakage does not explain the results on the other two corpora. We nevertheless agree that descriptive statistics for our own data would improve transparency; the revised manuscript will add length histograms and temporal-gap distributions to Section 3. Full quantitative anchoring (e.g., topic entropy) against external forensic datasets is not feasible without re-collecting or re-annotating those datasets under our criteria, but we will add a discussion of this limitation. revision: partial

  2. Referee: Results (Section 4 and abstract): the specific accuracy figures (72.7% on the chat corpus, >75% on the temporal corpus) are presented without statistical significance tests, confidence intervals, or baseline comparisons, leaving the empirical support for the central claims only moderately robust.

    Authors: We accept that the reported accuracies would be more robust with formal statistical support. The revised Section 4 will include bootstrap-derived 95% confidence intervals for all accuracy figures and pairwise significance tests (McNemar’s test) between methods. The evaluation already compares 12 methods that span simple baselines to the state of the art; we will additionally report an explicit random-guess baseline and a majority-class baseline to make this comparison explicit. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical evaluation on held-out corpora

full rationale

The paper performs direct empirical measurement: 12 AV methods are trained, optimized and evaluated on three self-compiled corpora, with reported accuracies (e.g., 72.7 % on short chats, >75 % on temporal gaps) obtained from held-out test data. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the chain. The central claims rest on observable performance numbers rather than any reduction to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the untested assumption that the self-compiled corpora mirror real forensic distributions and that the proposed characterization properties are adequate; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The proposed criteria and properties suffice to characterize the applicability of AV methods in real forensic settings
    The paper states that these properties aim to improve characterization of existing and future AV approaches.

pith-pipeline@v0.9.0 · 5747 in / 1314 out tokens · 27569 ms · 2026-05-25T17:33:45.519065+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

  1. [1]

    Hosein Azarbonyad, Mostafa Dehghani, Maarten Marx, and Jaap Kamps. 2015. Time-Aware Authorship Attribution for Short Text Streams. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’15). ACM, New York, NY, USA, 727–730

  2. [2]

    Douglas Bagnall. 2015. Author Identification Using Multi-headed Recurrent Neural Networks. In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015

  3. [3]

    Janek Bevendorff, Benno Stein, Matthias Hagen, and Martin Potthast. 2019. Gen- eralizing Unmasking for Short Texts. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 1 (Long and Short Papers) . Association for Computational Linguistics, Minneapolis, M...

  4. [4]

    Kenneth A. Bollen. 1989. Structural Equations with Latent Variables . Wiley

  5. [5]

    Mohamed Amine Boukhaled and Jean-Gabriel Ganascia. 2014. Probabilistic Anomaly Detection Method for Authorship Verification . Springer International Publishing, Cham, 211–219

  6. [6]

    Daniel Castro Castro, Yaritza Adame Arcia, María Pelaez Brioso, and Rafael Muñoz Guillena. 2015. Authorship Verification, Average Similarity Analysis. In Proceedings of the International Conference Recent Advances in Natural Language Processing. INCOMA Ltd. Shoumen, BULGARIA, 84–90

  7. [7]

    Tommi Gröndahl and N. Asokan. 2019. Text Analysis in Adversarial Settings: Does Deception Leave a Stylistic Trace? CoRR abs/1902.08939 (2019). arXiv:1902.08939

  8. [8]

    Oren Halvani, Lukas Graner, and Inna Vogel. 2018. Authorship Verification in the Absence of Explicit Features and Thresholds. InAdvances in Information Retrieval, Gabriella Pasi, Benjamin Piwowarski, Leif Azzopardi, and Allan Hanbury (Eds.). Springer International Publishing, 454–465

  9. [9]

    Oren Halvani and Martin Steinebach. 2014. An Efficient Intrinsic Authorship Verification Scheme Based on Ensemble Learning. In Ninth International Con- ference on A vailability, Reliability and Security, ARES 2014, Fribourg, Switzerland, September 8-12, 2014. Washington, DC, USA, 571–578

  10. [10]

    Oren Halvani, Christian Winter, and Lukas Graner. 2017. On the Usefulness of Compression Models for Authorship Verification. In Proceedings of the 12th International Conference on A vailability, Reliability and Security (ARES ’17). ACM, New York, NY, USA, Article 54, 10 pages

  11. [11]

    Oren Halvani, Christian Winter, and Anika Pflug. 2016. Authorship Verification for Different Languages, Genres and Topics. Digit. Investig. 16, S (March 2016), S33–S43

  12. [12]

    Josué Gerardo Gutiérrez Hernández, José Casillas, Paola Ledesma, Gibran Fuentes Pineda, and Iván Vladimir Meza Ruíz. 2015. Homotopy Based Classification for Author Verification Task: Notebook for PAN at CLEF 2015. In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015. ARES ’19, August 26–29, 20...

  13. [13]

    Ángel Hernández-Castañeda and Hiram Calvo. 2017. Author Verification Using a Semantic Space Model. Computación y Sistemas 21, 2 (2017)

  14. [14]

    David I. Holmes. 1998. The Evolution of Stylometry in Humanities Scholarship. Literary and Linguistic Computing 13, 3 (1998), 111–117

  15. [15]

    Manuela Hürlimann, Benno Weck, Esther von den Berg, Simon Šuster, and Malvina Nissim. 2015. GLAD: Groningen Lightweight Authorship Detection. In Working Notes of CLEF 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015 . 12

  16. [16]

    Magdalena Jankowska, Vlado Keselj, and Evangelos E. Milios. 2013. Proximity Based One-class Classification with Common N-Gram Dissimilarity for Author- ship Verification Task Notebook for PAN at CLEF 2013. In Working Notes for CLEF 2013 Conference , Valencia, Spain, September 23-26, 2013

  17. [17]

    Milios, and Vlado Keselj

    Magdalena Jankowska, Evangelos E. Milios, and Vlado Keselj. 2014. Author Verification Using Common N-Gram Profiles of Text Documents. In COLING 2014, 25th International Conference on Computational Linguistics, Proceedings of the Conference: Technical Papers, August 23-29, 2014, Dublin, Ireland , Jan Hajic and Junichi Tsujii (Eds.). ACL, 387–397

  18. [18]

    John Noecker Jr and Michael Ryan. 2012. Distractorless Authorship Verifica- tion. In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (23-25), Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Mehmet Uğur Doğan, Bente Maegaard, Joseph Mari- ani, Asuncion Moreno, Jan Odijk, and Stelios...

  19. [19]

    Patrick Juola and Efstathios Stamatatos. 2013. Overview of the Author Identifi- cation Task at PAN 2013. In Working Notes for CLEF 2013 Conference, Valencia, Spain, September 23-26, 2013 . 20

  20. [20]

    Mahmoud Khonji and Youssef Iraqi. 2014. A Slightly-Modified GI-Based Author- Verifier with Lots of Features (ASGALF). In Working Notes for CLEF 2014 Confer- ence, Sheffield, UK, September 15-18, 2014. 977–983

  21. [21]

    Mirco Kocher and Jacques Savoy. 2015. UniNE at CLEF 2015 Author Identification: Notebook for PAN at CLEF 2015. In CLEF (Working Notes) (CEUR Workshop Proceedings), Vol. 1391. CEUR-WS.org

  22. [22]

    Moshe Koppel and Jonathan Schler. 2004. Authorship Verification as a One- Class Classification Problem. In Machine Learning, Proceedings of the Twenty-first International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004 (ACM International Conference Proceeding Series) , Carla E. Brodley (Ed.), Vol. 69. ACM

  23. [23]

    Moshe Koppel and Yaron Winter. 2014. Determining if Two Documents are Written by the Same Author. JASIST 65, 1 (2014), 178–187

  24. [24]

    Neal, Kalaivani Sundararajan, and Damon L

    Tempestt J. Neal, Kalaivani Sundararajan, and Damon L. Woodard. 2018. Exploit- ing Linguistic Style as a Cognitive Biometric for Continuous Verification. In2018 International Conference on Biometrics, ICB 2018, Gold Coast, Australia, February 20-23, 2018. IEEE, 270–276

  25. [25]

    J. Olsson. 2008. Forensic Linguistics: Second Edition: An Introduction To Language, Crime and the Law . Bloomsbury Academic

  26. [26]

    Nektaria Potha and Efstathios Stamatatos. 2014. A Profile-Based Method for Authorship Verification. In Artificial Intelligence: Methods and Applications: 8th Hellenic Conference on AI, SETN 2014, Ioannina, Greece, May 15–17, 2014. Proceed- ings. Springer International Publishing, 313–326

  27. [27]

    Nektaria Potha and Efstathios Stamatatos. 2017. An Improved Impostors Method for Authorship Verification. InExperimental IR Meets Multilinguality, Multimodal- ity, and Interaction - 8th International Conference of the CLEF Association, CLEF 2017, Dublin, Ireland, September 11-14, 2017, Proceedings . 138–144

  28. [28]

    Nektaria Potha and Efstathios Stamatatos. 2018. Intrinsic Author Verification Using Topic Modeling. In Proceedings of the 10th Hellenic Conference on Artificial Intelligence, SETN 2018, Patras, Greece, July 09-12, 2018 . ACM, 20:1–20:7

  29. [29]

    Martin Potthast, Matthias Hagen, and Benno Stein. 2016. Author Obfuscation: Attacking the State of the Art in Authorship Verification. InWorking Notes Papers of the CLEF 2016 Evaluation Labs (CEUR Workshop Proceedings) , Vol. 1609. CLEF and CEUR-WS.org, 716–749

  30. [30]

    Martin Potthast, Paolo Rosso, Efstathios Stamatatos, and Benno Stein. 2019. A Decade of Shared Tasks in Digital Text Forensics at PAN. In Advances in Information Retrieval, Leif Azzopardi, Benno Stein, Norbert Fuhr, Philipp Mayr, Claudia Hauff, and Djoerd Hiemstra (Eds.). Springer International Publishing, Cham, 291–300

  31. [31]

    Rodionova, Paolo Oliveri, and Alexey L

    Oxana Ye. Rodionova, Paolo Oliveri, and Alexey L. Pomerantsev. 2016. Rigor- ous and Compliant Approaches to One-Class Classification. Chemometrics and Intelligent Laboratory Systems 159 (2016), 89 – 96

  32. [32]

    Conrad Sanderson and Simon Guenter. 2006. Short Text Authorship Attribution via Sequence Kernels, Markov Chains and Author Unmasking: An Investigation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP ’06). Association for Computational Linguistics, Stroudsburg, PA, USA, 482–491

  33. [33]

    Shachar Seidman. 2013. Authorship Verification Using the Impostors Method Notebook for PAN at CLEF 2013. In Working Notes for CLEF 2013 Conference , Valencia, Spain, September 23-26, 2013

  34. [34]

    Efstathios Stamatatos. 2009. A Survey of Modern Authorship Attribution Methods. J. Am. Soc. Inf. Sci. Technol. 60, 3 (March 2009), 538–556

  35. [35]

    Efstathios Stamatatos. 2013. On the Robustness of Authorship Attribution Based on Character N-Gram Features. Journal of Law and Policy 21 (01 2013), 421–439

  36. [36]

    Efstathios Stamatatos. 2017. Authorship Attribution Using Text Distortion. In Proceedings of the 15th Conference of the European Chapter of the Association for the Computational Linguistics, EACL 2017, April 3-7, 2017, Valencia, Spain . The Association for Computer Linguistics

  37. [37]

    Efstathios Stamatatos, Walter Daelemans, Ben Verhoeven, Patrick Juola, Aurelio López-López, Martin Potthast, and Benno Stein. 2015. Overview of the Author Identification Task at PAN 2015. In Working Notes of CLEF 2015 – Conference and Labs of the Evaluation forum, Toulouse, France, September 8–11, 2015 . 17

  38. [38]

    Sánchez-Pérez, and Alberto Barrón-Cedeño

    Efstathios Stamatatos, Walter Daelemans, Ben Verhoeven, Benno Stein, Martin Potthast, Patrick Juola, Miguel A. Sánchez-Pérez, and Alberto Barrón-Cedeño

  39. [39]

    In Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15–18, 2014

    Overview of the Author Identification Task at PAN 2014. In Working Notes for CLEF 2014 Conference, Sheffield, UK, September 15–18, 2014 . 877–897

  40. [40]

    Kokkinakis

    Efstathios Stamatatos, Nikos Fakotakis, and George K. Kokkinakis. 2000. Au- tomatic Text Categorization in Terms of Genre and Author. Computational Linguistics 26, 4 (2000), 471–495

  41. [41]

    Benno Stein, Nedim Lipka, and Sven Meyer zu Eissen. 2008. Meta Analysis within Authorship Verification. In 19th International Workshop on Database and Expert Systems Applications (DEXA 2008), 1-5 September 2008, Turin, Italy . IEEE Computer Society, 34–39

  42. [42]

    David Martinus Johannes Tax. 2001. One-Class Classification: Concept Learning In the Absence of Counter-Examples . Ph.D. Dissertation. Delft University of Technology

  43. [43]

    Veenman and Zhenshi Li

    Cor J. Veenman and Zhenshi Li. 2013. Authorship Verification with Compression Features. In Working Notes for CLEF 2013 Conference , Valencia, Spain, September 23–26, 2013. 6