HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction

Florian Cafiero; Francis Kulumba; Guillaume Vimont; Laurent Romary; Wissam Antoun

arxiv: 2407.20595 · v5 · pith:L3LVS4YNnew · submitted 2024-07-30 · 💻 cs.DL · cs.CL

HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction

Francis Kulumba , Wissam Antoun , Guillaume Vimont , Laurent Romary , Florian Cafiero This is my paper

Pith reviewed 2026-05-23 23:09 UTC · model grok-4.3

classification 💻 cs.DL cs.CL

keywords authorship attributionlate interactioncontrastive corpuspatch-level matchingscholarly paperstopical confoundretrieval methodsmultilingual corpus

0 comments

The pith

Sequence-level matching with patch-level late interaction outperforms single-vector baselines for authorship attribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a 17-billion-token multilingual corpus of scholarly papers and derives an English contrastive version where same-author passages come from distinct papers in the same field. This construction reduces the chance that models exploit topical overlap when deciding whether two texts share an author. Documents are represented not as single vectors but as sequences of vectors, which are compared through late interaction; a new Patch-Level Late Interaction variant first collapses neighboring tokens into patches before matching. Sequence-level comparison raises accuracy over the single-vector approach, although the best interaction granularity varies with the setup.

Core claim

Authorship attribution improves when documents are kept as sequences of vectors and compared with late interaction or its patch-level variant, rather than compressed to single vectors, on a contrastive corpus that draws same-author examples from separate papers to limit topical signals.

What carries the argument

Patch-Level Late Interaction (PLI), which compresses neighboring tokens into patches before performing sequence matching between documents.

Load-bearing premise

Drawing same-author passages from distinct papers within a field keeps topical overlap low enough that measured gains reflect style rather than content similarity.

What would settle it

Retraining and testing the same models on a version of the corpus where same-author passages are drawn from the same paper instead of different papers would show whether the reported gains depend on the contrastive separation of topic and author.

Figures

Figures reproduced from arXiv: 2407.20595 by Florian Cafiero, Francis Kulumba, Guillaume Vimont, Laurent Romary, Wissam Antoun.

read the original abstract

Authorship attribution asks whether two pieces of text share a writer, but topical confound makes the task deceptively easy: two authors covering the same topic may look more alike than one author covering two topics. Scholarly prose offers a natural remedy, academic writers produce multiple papers on related but distinct topics while maintaining consistent stylistic habits. We introduce HALvest, a 17-billion-token multilingual corpus of open-access academic papers, and its English contrastive derivative HALvest-Contrastive, where same-author passages are drawn from distinct papers within a disciplinary field to minimize topical overlap. We validate our benchmark by showing that a strong lexical baseline collapses once topical shortcuts are removed. On this same benchmark, we revisit how authorship is scored. Standard systems compress each document into a single vector. We instead keep a sequence of vectors and compare them with late interaction, then propose patch-level late interaction, which groups neighboring tokens into patches before matching. Matching at the sequence level greatly improves performance over the single-vector baseline, but the optimal interaction granularity is subtle.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is a contrastive corpus construction plus patch-level late interaction that beats single-vector baselines on authorship attribution, but the topical overlap concern is not obviously resolved and no numbers appear in the abstract.

read the letter

The punchline is that this work gives a practical way to build harder negative pairs for authorship attribution by sampling same-author text from different papers inside the same field, then shows that keeping token sequences and doing late interaction (with a patch compression variant) improves over the usual single embedding per document. That combination looks new relative to the cited single-vector work. The corpus effort itself is substantial and the multilingual angle is a plus for the task. The abstract's claim that sequence-level matching helps and that granularity is subtle is at least a clear empirical direction. The stress-test worry about residual sub-topic overlap is reasonable on the surface, but the paper's construction at least tries to control for field-level topic; without seeing the full methods it is hard to say whether they checked topical similarity between pairs or just assumed the field split was enough. The absence of any numbers, baselines, or dataset sizes in the abstract makes it difficult to judge effect size or whether the evaluation is tight. If the full paper has solid ablations and a check that same-author pairs are not more topically similar than cross-author ones, the central claim holds; otherwise the gains could be partly topical. This is mainly for people already working on authorship attribution or scholarly text analysis rather than a broad audience. The corpus might get used even if the method does not. It is coherent enough to send for review, though the topical control and quantitative details will need close referee attention.

Referee Report

2 major / 1 minor

Summary. The paper introduces HALvest, a 17-billion-token multilingual corpus of open-access scholarly papers, and HALvest-Contrastive, its English contrastive derivative in which same-author passages are drawn from distinct papers within the same field to reduce topical overlap. It argues that traditional single-vector document representations are insufficient for authorship attribution and instead retains sequences of vectors, compares them via late interaction, and introduces Patch-Level Late Interaction (PLI) that compresses neighboring tokens into patches; the central empirical claim is that sequence-level matching substantially outperforms the single-vector baseline, although the optimal interaction granularity is subtle.

Significance. If the performance gains hold after adequate control for topical signals and the corpus is released, the work supplies both a large-scale resource for authorship and stylometry research and concrete evidence that retrieval-style late interaction can better separate style from topic in scholarly writing. The scale of the corpus and the contrastive construction are clear strengths.

major comments (2)

[HALvest-Contrastive construction] HALvest-Contrastive construction: the claim that drawing same-author passages from distinct papers within a field sufficiently minimizes topical overlap is not supported by any explicit verification (e.g., cosine similarity or topic-model overlap between same-author vs. different-author pairs). This is load-bearing because the central claim attributes measured gains to stylistic capture rather than residual topical signals.
[Evaluation] Evaluation section: the abstract states that sequence-level matching 'greatly improves performance' over the single-vector baseline, yet no quantitative results, baselines, metrics, or statistical tests are supplied in the provided text, preventing assessment of effect size or robustness.

minor comments (1)

[Abstract] Abstract: 'PLI' is introduced without spelling out 'Patch-Level Late Interaction' on first use.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments identify key areas where additional evidence and detail will strengthen the manuscript. We address each point below and will revise accordingly.

read point-by-point responses

Referee: [HALvest-Contrastive construction] HALvest-Contrastive construction: the claim that drawing same-author passages from distinct papers within a field sufficiently minimizes topical overlap is not supported by any explicit verification (e.g., cosine similarity or topic-model overlap between same-author vs. different-author pairs). This is load-bearing because the central claim attributes measured gains to stylistic capture rather than residual topical signals.

Authors: We agree that explicit verification is needed to support this load-bearing claim. The current manuscript does not include such analysis. In the revision we will add quantitative checks, including average cosine similarities of embeddings and topic-model overlap statistics, comparing same-author versus different-author pairs within HALvest-Contrastive. revision: yes
Referee: [Evaluation] Evaluation section: the abstract states that sequence-level matching 'greatly improves performance' over the single-vector baseline, yet no quantitative results, baselines, metrics, or statistical tests are supplied in the provided text, preventing assessment of effect size or robustness.

Authors: We acknowledge the omission. The provided text does not contain the detailed quantitative results. We will expand the Evaluation section in the revised manuscript to report the specific performance numbers, baselines, metrics, effect sizes, and statistical tests that support the abstract claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical corpus construction and evaluation are self-contained.

full rationale

The paper constructs HALvest-Contrastive by drawing same-author passages from distinct papers within fields and evaluates sequence-level late interaction (including PLI) against single-vector baselines on this corpus. No equations, fitted parameters, or self-citations reduce the reported performance gains to quantities defined by the inputs themselves. The central claims rest on measured improvements on held-out data, which are externally falsifiable and independent of any internal redefinition or prediction-by-construction. This is the standard case of an empirical study with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Report based on abstract only; no explicit free parameters, axioms, or invented entities are stated. The central assumption that the contrastive sampling removes topical confound is treated as a domain assumption rather than a derived result.

axioms (1)

domain assumption Same-author passages drawn from distinct papers within a field minimize topical overlap enough for attribution gains to reflect style.
Stated in the abstract as the motivation for HALvest-Contrastive.

pith-pipeline@v0.9.0 · 5677 in / 1110 out tokens · 16582 ms · 2026-05-23T23:09:01.327039+00:00 · methodology

HALvest-Contrastive: Retrieval-Like Authorship Attribution with Patch-Level Late Interaction

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)