arxiv: 2604.16376 · v1 · submitted 2026-03-24 · 💻 cs.CL · cs.CR

Recognition: no theorem link

Foundational Study on Authorship Attribution of Japanese Web Reviews for Actor Analysis

Hiroshi Matsubara , Shingo Matsugaya , Taichi Aoki , Masaki Hashimoto

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:55 UTC · model grok-4.3

classification 💻 cs.CL cs.CR

keywords authorship attributionJapanese web reviewsBERT fine-tuningTF-IDF logistic regressionthreat intelligenceactor analysisstylistic featuresmulti-author classification

0 comments

The pith

TF-IDF with logistic regression outperforms fine-tuned BERT for Japanese authorship attribution once author counts reach several hundred.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether stylistic features from clear-web Japanese reviews can support author identification as a step toward analyzing dark web forums for threat intelligence. Experiments on Rakuten Ichiba review data compare TF-IDF plus logistic regression, BERT embeddings with logistic regression, BERT fine-tuning, and metric learning with nearest neighbors. BERT fine-tuning reaches the highest accuracy when the number of authors stays small, yet training grows unstable beyond a few hundred authors. At that scale TF-IDF plus logistic regression delivers better accuracy, stability, and lower computational cost while top-k ranking aids candidate screening. Error analysis shows boilerplate text, topic shifts, and short texts drive most misclassifications.

Core claim

BERT fine-tuning achieves the best performance on Japanese review authorship attribution for limited author sets, yet becomes unstable as the number of authors scales to several hundred, at which point TF-IDF combined with logistic regression proves superior in accuracy, stability, and computational cost. Top-k evaluation confirms the value of candidate screening, and error analysis identifies boilerplate text, topic dependency, and short text length as the main causes of misclassification. The study positions these results as a foundational benchmark for future application to dark web forum posts in actor analysis.

What carries the argument

Comparison of TF-IDF+LR, BERT-Emb+LR, BERT-FT, and Metric+kNN on multi-author classification using stylistic features from Japanese web reviews

If this is right

BERT fine-tuning yields peak accuracy when distinguishing among a small number of authors.
TF-IDF with logistic regression becomes preferable once author counts reach several hundred due to stability and efficiency.
Top-k candidate ranking supports practical screening in authorship tasks.
Boilerplate text and topic-dependent content are primary sources of misclassification.
Short text lengths increase error rates in style-based attribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the clear-web models transfer with modest loss, they could enable initial actor screening in large-scale threat intelligence monitoring.
Domain-adaptation methods may be required to bridge differences between review prose and forum language.
The efficiency of TF-IDF+LR suggests it could run as a first-pass filter on high-volume text streams.
Collecting longer and more varied Japanese texts could reduce topic-dependency errors in follow-on work.

Load-bearing premise

Stylistic features extracted from clear-web Japanese reviews will transfer to dark-web forum posts without major domain shift in language or topic.

What would settle it

Training the models on Rakuten reviews then testing them on a held-out set of dark-web Japanese forum posts from known authors and observing a large accuracy drop would falsify reliable transfer.

Figures

Figures reproduced from arXiv: 2604.16376 by Hiroshi Matsubara, Masaki Hashimoto, Shingo Matsugaya, Taichi Aoki.

**Figure 1.** Figure 1: Overview of the evaluation pipeline. 3.5.2 Top-k Accuracy (Candidate Screening Performance) Top-k Accuracy measures the proportion of instances where the correct author appears among the top k candidates ranked by model output scores. In large-scale settings where exact identification may be difficult, this metric evaluates whether the system can narrow down candidates to a manageable set (screening capabi… view at source ↗

**Figure 2.** Figure 2: Experiment 1: Relationship between reviews per author [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Experiment 1: Relationship between reviews per author [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Relationship between k and Top-k performance at U = 100. 4.4.1 Impact of Boilerplate Text (Genre/Register Homogenization) Reviews frequently contain formulaic expressions such as shipping confirmations and packaging evaluations (e.g., “Thank you for the prompt response”). As such expressions increase, inter-author linguistic similarity rises, making style-based discrimination more difficult. From the persp… view at source ↗

**Figure 5.** Figure 5: Experiment 3 (small scale): U = 2–100 [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Experiment 3 (small scale): U = 2–100, log scale [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

This study investigates the applicability of authorship attribution based on stylistic features to support actor analysis in threat intelligence. As a foundational step toward future application to dark web forums, we conducted experiments using Japanese review data from clear web sources. We constructed datasets from Rakuten Ichiba reviews and compared four methods: TF-IDF with logistic regression (TF-IDF+LR), BERT embeddings with logistic regression (BERT-Emb+LR), BERT fine-tuning (BERT-FT), and metric learning with $k$-nearest neighbors (Metric+kNN). Results showed that BERT-FT achieved the best performance; however, training became unstable as the number of authors scaled to several hundred, where TF-IDF+LR proved superior in terms of accuracy, stability, and computational cost. Furthermore, Top-$k$ evaluation demonstrated the utility of candidate screening, and error analysis revealed that boilerplate text, topic dependency, and short text length were primary factors causing misclassification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Clean comparison of standard methods on Japanese reviews, but the dark-web transfer story lacks any supporting experiments.

read the letter

This paper is basically an empirical benchmark of four off-the-shelf authorship attribution approaches on Japanese product reviews from Rakuten. They test TF-IDF plus logistic regression, BERT embeddings with LR, fine-tuned BERT, and metric learning with kNN. The key result is that fine-tuned BERT performs best when the number of candidate authors stays small, but training gets unstable and performance drops as you scale to several hundred authors, where the simpler TF-IDF+LR method becomes more accurate, stable, and cheaper to run. They also show that top-k evaluation helps with candidate screening and identify boilerplate, topic dependence, and short texts as main error sources. The work does a decent job laying out these scaling behaviors and error factors in a clear way. For someone building practical systems, seeing when BERT starts to falter and why is concrete information. The main limitation is that none of this is tested on dark-web data. The introduction frames the Rakuten experiments as a foundational step toward actor analysis on dark-web forums for threat intelligence, yet there are no dark-web posts, no domain adaptation trials, and no comparison of feature distributions across the two settings. That leaves the applicability to the intended use case as an assumption rather than a demonstrated result. The abstract also skips numbers on dataset sizes, exact accuracies, and any statistical significance tests, which makes it tough to gauge how robust the scaling threshold really is. This kind of paper is useful for applied researchers working on Japanese text or security-related attribution tasks who need data on method trade-offs at different scales. It won't change how people think about the core problem, but the stability findings are worth noting. I would bring it to a reading group to discuss the practical scaling limits. It should go to peer review because the experiments are reproducible and the results on when to prefer simpler models are actionable, even if the broader threat-intelligence positioning needs more evidence to hold up.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an empirical comparison of four authorship attribution methods—TF-IDF with logistic regression, BERT embeddings with logistic regression, BERT fine-tuning, and metric learning with kNN—on Japanese product reviews from the Rakuten Ichiba clear-web corpus. It finds that BERT fine-tuning yields the highest performance at small author counts but becomes unstable beyond a few hundred authors, at which point TF-IDF+LR is superior in accuracy, stability, and computational cost. Top-k evaluation is shown to be useful for candidate screening, and error analysis identifies boilerplate text, topic dependence, and short review length as primary sources of misclassification. The work is positioned as a foundational study toward applying stylistic authorship attribution to dark-web forums for actor analysis in threat intelligence.

Significance. If the observed scaling behavior and error factors generalize, the results supply concrete guidance on method selection for Japanese authorship attribution tasks at varying author-set sizes. The explicit identification of practical failure modes (boilerplate, topic, length) is a useful contribution for future system design. However, because no domain-shift experiments, feature-distribution comparisons, or dark-web data are included, the significance for the stated threat-intelligence use case remains conditional on an untested transfer assumption.

major comments (2)

[Abstract and §1] Abstract and §1 (Introduction): The central framing that the Rakuten Ichiba experiments constitute a 'foundational step toward future application to dark web forums' for actor analysis rests on the unverified assumption that stylistic signals will survive domain differences in register, topic distribution, anonymity, and boilerplate. No cross-domain evaluation, zero-shot transfer test, synthetic domain-shift simulation, or stylistic-feature comparison between clear-web reviews and dark-web text is reported, making the applicability claim load-bearing yet unsupported.
[Results section] Results section (presumably §4): The comparative claims (BERT-FT superiority at small scale, instability beyond several hundred authors, TF-IDF+LR superiority at scale) are presented without exact accuracy numbers, dataset statistics (number of authors, reviews per author, total samples), statistical significance tests, or ablation details. These omissions prevent quantitative assessment of the scaling threshold and undermine the strength of the stability and efficiency conclusions.

minor comments (2)

[Abstract] Method names (TF-IDF+LR, BERT-FT, etc.) should be defined at first use in the main text and used consistently; the abstract introduces them without prior definition.
[Results] The paper should report the precise author-count threshold at which BERT-FT training instability begins and the corresponding accuracy values for all four methods at that scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below. Where the comments identify gaps in quantitative detail or overly strong framing, we have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and §1] Abstract and §1 (Introduction): The central framing that the Rakuten Ichiba experiments constitute a 'foundational step toward future application to dark web forums' for actor analysis rests on the unverified assumption that stylistic signals will survive domain differences in register, topic distribution, anonymity, and boilerplate. No cross-domain evaluation, zero-shot transfer test, synthetic domain-shift simulation, or stylistic-feature comparison between clear-web reviews and dark-web text is reported, making the applicability claim load-bearing yet unsupported.

Authors: We accept that the original wording risked implying direct transferability. The Rakuten experiments were designed as a controlled study of Japanese stylistic attribution on clear-web product reviews; the dark-web motivation is aspirational and not a claim of immediate applicability. In the revised manuscript we have rewritten the abstract and Section 1 to state explicitly that (i) all reported results are specific to the Rakuten domain, (ii) domain-shift robustness remains untested, and (iii) future work will be required to assess transfer to dark-web registers. This removes the unsupported load-bearing claim while preserving the study’s stated motivation. revision: yes
Referee: [Results section] Results section (presumably §4): The comparative claims (BERT-FT superiority at small scale, instability beyond several hundred authors, TF-IDF+LR superiority at scale) are presented without exact accuracy numbers, dataset statistics (number of authors, reviews per author, total samples), statistical significance tests, or ablation details. These omissions prevent quantitative assessment of the scaling threshold and undermine the strength of the stability and efficiency conclusions.

Authors: We agree that the original results section lacked sufficient granularity. The revised version now includes: (a) a table reporting exact top-1 and top-5 accuracies for each method at author-set sizes 10, 50, 100, 200, 500 and 1000; (b) full dataset statistics (number of authors, mean/median reviews per author, total tokens); (c) McNemar’s test p-values comparing TF-IDF+LR against BERT-FT at each scale; and (d) an ablation table isolating the effects of review length and boilerplate removal. These additions allow readers to locate the precise scaling threshold and to assess the stability and efficiency claims quantitatively. revision: yes

Circularity Check

0 steps flagged

No circularity; standard empirical benchmark of off-the-shelf methods

full rationale

The paper performs a direct empirical comparison of four established authorship attribution techniques (TF-IDF+LR, BERT-Emb+LR, BERT-FT, Metric+kNN) on held-out splits of Rakuten Ichiba review data. All reported accuracies, stability observations, and error analyses derive from standard train/test evaluation on external text without any parameter fitting that is later relabeled as a prediction, without self-citation chains supporting uniqueness theorems, and without equations or derivations that reduce to their own inputs by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim rests on standard supervised classification assumptions plus the domain transfer premise; no new entities or parameters invented beyond typical model hyperparameters.

free parameters (1)

number of authors threshold
Point at which BERT-FT instability appears, determined experimentally rather than derived.

axioms (1)

domain assumption Stylistic features remain consistent enough across clear-web and dark-web Japanese text for attribution to transfer
Invoked to justify the foundational study as preparation for dark-web application.

pith-pipeline@v0.9.0 · 5467 in / 1113 out tokens · 32804 ms · 2026-05-15T00:55:00.235738+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Overview of the author identification task at PAN-2018: Cross-domain authorship attribution and style change detection

Mike Kestemont, Efstathios Stamatatos, Enrique Manjavacas, Walter Daelemans, Martin Potthast, and Benno Stein. Overview of the author identification task at PAN-2018: Cross-domain authorship attribution and style change detection. InWorking Notes of CLEF 2018,

work page 2018
[2]

PART: Pre-trained authorship representation transformer

Javier Huertas-Tato, Alejandro Martín, and David Camacho. PART: Pre-trained authorship representation transformer. arXiv preprint arXiv:2209.15373,

work page arXiv
[3]

(dataset)

URLhttps://doi.org/10.32130/idr.2.1. (dataset). 11

work page doi:10.32130/idr.2.1