HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction
Pith reviewed 2026-05-23 23:09 UTC · model grok-4.3
The pith
Sequence-level matching with patch-level late interaction outperforms single-vector baselines for authorship attribution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Authorship attribution improves when documents are kept as sequences of vectors and compared with late interaction or its patch-level variant, rather than compressed to single vectors, on a contrastive corpus that draws same-author examples from separate papers to limit topical signals.
What carries the argument
Patch-Level Late Interaction (PLI), which compresses neighboring tokens into patches before performing sequence matching between documents.
Load-bearing premise
Drawing same-author passages from distinct papers within a field keeps topical overlap low enough that measured gains reflect style rather than content similarity.
What would settle it
Retraining and testing the same models on a version of the corpus where same-author passages are drawn from the same paper instead of different papers would show whether the reported gains depend on the contrastive separation of topic and author.
Figures
read the original abstract
Authorship attribution asks whether two pieces of text share a writer, but topical confound makes the task deceptively easy: two authors covering the same topic may look more alike than one author covering two topics. Scholarly prose offers a natural remedy, academic writers produce multiple papers on related but distinct topics while maintaining consistent stylistic habits. We introduce HALvest, a 17-billion-token multilingual corpus of open-access academic papers, and its English contrastive derivative HALvest-Contrastive, where same-author passages are drawn from distinct papers within a disciplinary field to minimize topical overlap. We validate our benchmark by showing that a strong lexical baseline collapses once topical shortcuts are removed. On this same benchmark, we revisit how authorship is scored. Standard systems compress each document into a single vector. We instead keep a sequence of vectors and compare them with late interaction, then propose patch-level late interaction, which groups neighboring tokens into patches before matching. Matching at the sequence level greatly improves performance over the single-vector baseline, but the optimal interaction granularity is subtle.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces HALvest, a 17-billion-token multilingual corpus of open-access scholarly papers, and HALvest-Contrastive, its English contrastive derivative in which same-author passages are drawn from distinct papers within the same field to reduce topical overlap. It argues that traditional single-vector document representations are insufficient for authorship attribution and instead retains sequences of vectors, compares them via late interaction, and introduces Patch-Level Late Interaction (PLI) that compresses neighboring tokens into patches; the central empirical claim is that sequence-level matching substantially outperforms the single-vector baseline, although the optimal interaction granularity is subtle.
Significance. If the performance gains hold after adequate control for topical signals and the corpus is released, the work supplies both a large-scale resource for authorship and stylometry research and concrete evidence that retrieval-style late interaction can better separate style from topic in scholarly writing. The scale of the corpus and the contrastive construction are clear strengths.
major comments (2)
- [HALvest-Contrastive construction] HALvest-Contrastive construction: the claim that drawing same-author passages from distinct papers within a field sufficiently minimizes topical overlap is not supported by any explicit verification (e.g., cosine similarity or topic-model overlap between same-author vs. different-author pairs). This is load-bearing because the central claim attributes measured gains to stylistic capture rather than residual topical signals.
- [Evaluation] Evaluation section: the abstract states that sequence-level matching 'greatly improves performance' over the single-vector baseline, yet no quantitative results, baselines, metrics, or statistical tests are supplied in the provided text, preventing assessment of effect size or robustness.
minor comments (1)
- [Abstract] Abstract: 'PLI' is introduced without spelling out 'Patch-Level Late Interaction' on first use.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments identify key areas where additional evidence and detail will strengthen the manuscript. We address each point below and will revise accordingly.
read point-by-point responses
-
Referee: [HALvest-Contrastive construction] HALvest-Contrastive construction: the claim that drawing same-author passages from distinct papers within a field sufficiently minimizes topical overlap is not supported by any explicit verification (e.g., cosine similarity or topic-model overlap between same-author vs. different-author pairs). This is load-bearing because the central claim attributes measured gains to stylistic capture rather than residual topical signals.
Authors: We agree that explicit verification is needed to support this load-bearing claim. The current manuscript does not include such analysis. In the revision we will add quantitative checks, including average cosine similarities of embeddings and topic-model overlap statistics, comparing same-author versus different-author pairs within HALvest-Contrastive. revision: yes
-
Referee: [Evaluation] Evaluation section: the abstract states that sequence-level matching 'greatly improves performance' over the single-vector baseline, yet no quantitative results, baselines, metrics, or statistical tests are supplied in the provided text, preventing assessment of effect size or robustness.
Authors: We acknowledge the omission. The provided text does not contain the detailed quantitative results. We will expand the Evaluation section in the revised manuscript to report the specific performance numbers, baselines, metrics, effect sizes, and statistical tests that support the abstract claim. revision: yes
Circularity Check
No circularity: empirical corpus construction and evaluation are self-contained.
full rationale
The paper constructs HALvest-Contrastive by drawing same-author passages from distinct papers within fields and evaluates sequence-level late interaction (including PLI) against single-vector baselines on this corpus. No equations, fitted parameters, or self-citations reduce the reported performance gains to quantities defined by the inputs themselves. The central claims rest on measured improvements on held-out data, which are externally falsifiable and independent of any internal redefinition or prediction-by-construction. This is the standard case of an empirical study with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Same-author passages drawn from distinct papers within a field minimize topical overlap enough for attribution gains to reflect style.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.