Where Does Authorship Signal Emerge in Encoder-Based Language Models?

Florian Cafiero; Francis Kulumba; Guillaume Vimont; Laurent Romary

arxiv: 2605.19908 · v2 · pith:YU2U2MWXnew · submitted 2026-05-19 · 💻 cs.CL

Where Does Authorship Signal Emerge in Encoder-Based Language Models?

Francis Kulumba , Guillaume Vimont , Laurent Romary , Florian Cafiero This is my paper

Pith reviewed 2026-05-20 06:07 UTC · model grok-4.3

classification 💻 cs.CL

keywords authorship attributionmechanistic interpretabilityencoder language modelsscoring mechanismslayer-wise analysisstylistic featurescausal interventionmean pooling

0 comments

The pith

The scoring mechanism alone decides the layer where encoder models consolidate authorship signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that authorship attribution models using identical encoders, data, and training loss still vary four-fold in accuracy based only on how they score representations. Stylistic cues such as word length, punctuation density, and function-word frequency appear equally at every layer across models, including untouched control encoders. Causal interventions then reveal that the scorer itself controls when the encoder gathers the authorship signal into usable form. Mean pooling drives early-to-mid-layer consolidation while late interaction pushes the same process to later layers. This timing difference traces directly to each scorer's gradient structure and produces separate training paths.

Core claim

Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss differ up to four-fold in performance solely due to their scoring mechanism. Mechanistic tools show stylistic features remain available at every layer in every encoder, including off-the-shelf controls. Causal interventions establish that the scorer dictates consolidation timing: mean pooling forces the signal to consolidate by early-to-mid layers, whereas late interaction defers consolidation to later layers. The difference follows from the distinct gradient structures of the two scorers and produces correspondingly distinct learning trajectories.

What carries the argument

Causal intervention that isolates layer-wise authorship signal under mean pooling versus late interaction scorers.

If this is right

Mean pooling models learn to rely on early-layer representations while late-interaction models continue refining signal in deeper layers.
Training dynamics diverge because each scorer back-propagates authorship gradients through different depths.
Performance gaps arise from the timing of signal consolidation rather than from differences in what features the encoder can represent.
Changing only the final scorer can move the effective depth at which an encoder solves the same stylistic task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers of style-sensitive classifiers may improve results by deliberately choosing scorers that delay consolidation when deeper contextual cues matter.
The same layer-timing logic could explain performance differences in other attribute classification tasks that rely on subtle surface patterns.
Directly editing gradient flow during training might let practitioners control consolidation depth without swapping scorers.

Load-bearing premise

Stylistic features stay equally detectable at every layer in every model even after fine-tuning, so performance gaps cannot come from uneven feature availability.

What would settle it

A controlled experiment that measures authorship attribution accuracy after zeroing stylistic features only in early layers and finds mean-pooling models degrade far more than late-interaction models.

Figures

Figures reproduced from arXiv: 2605.19908 by Florian Cafiero, Francis Kulumba, Guillaume Vimont, Laurent Romary.

**Figure 1.** Figure 1: Conceptual overview. Left: The pretrained language model encodes stylistic features at every layer, regardless of fine-tuning. Center: Two scoring mechanisms read out these features differently. Mean pooling averages all tokens into a single vector. Late interaction (LI) (Khattab and Zaharia, 2020) compares tokens directly. Right: Causal intervention reveals that the scoring mechanism determines where the … view at source ↗

**Figure 2.** Figure 2: Token length distributions for positive (blue) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: LISA probe R2 heatmaps at the final checkpoint. Rows are stylistic feature categories. Columns are encoder layers. The three fine-tuned models produce nearly identical heatmaps. Word length is the most readable feature (R2 ≈ 0.57), followed by capitalization rate, type–token ratio, and punctuation density. 0 5 10 15 20 Patch layer index 0.0 0.2 0.4 0.6 0.8 1.0 Fraction rank-recovered Rank recovery all mode… view at source ↗

**Figure 4.** Figure 4: Rank recovery across the three models. Each panel shows one tier. Purple: layerwise (mean pooling), orange: LI, green: PLI n=2. Dashed line: chance (0.5). Mean pooling crosses chance at layer 9, while both interaction models cross at layers 14–16. The six-layer gap is consistent across all three tiers. layer 13. This pattern is consistent across all three tiers. On Tier C, all models show slightly abovech… view at source ↗

**Figure 5.** Figure 5: Score sensitivity per layer. Mean |s (ℓ) patched − scorrupt| when restoring clean activations at layer ℓ. LI (orange) is most sensitive, PLI (green) is intermediate, layerwise (purple) is an order of magnitude lower. intermediate checkpoints ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics. Mean percentage recovery across Tier A triplets at eight checkpoints. Each subplot is one checkpoint. x-axis: layer index; y-axis: mean recovery. Percentage recovery is used here because rank recovery is binary and too coarse to track gradual signal emergence at early checkpoints. The y-axis extremes reflect the known instability of percentage recovery (§2.5). duce nearly identical probe… view at source ↗

read the original abstract

Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are similarly available at every layer in every model we probe, including an off-the-shelf control encoder, suggesting that the gap is not explained by their linear readability. Instead, causal intervention shows that the scorer appears to determine where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Scorer choice moves authorship signal consolidation across layers via gradient structure, with interventions showing the timing difference but resting on limited stylistic feature checks.

read the letter

The main takeaway is that the scoring head determines when the encoder consolidates authorship signal. Mean pooling leads to early or mid-layer consolidation while late interaction defers it, and this timing difference accounts for the large performance gap even when basic stylistic features remain available throughout the model including in a control encoder. They trace the difference back to the gradient structure of each scorer and show matching training trajectories. This is a direct mechanistic account rather than another accuracy table. The interventions and layer-wise checks give a concrete way to see why one head works better than another by a factor of four. The gradient derivation keeps the explanation from depending on the final numbers alone. The control encoder helps separate representation quality from consolidation timing. A soft spot is the set of features used to establish equal availability across layers. Word length, punctuation density, and function-word frequency are straightforward surface cues, but authorship often involves higher-order patterns such as syntactic preferences or rare word choices that might consolidate or interact differently. If those are not fully covered, the claim that timing alone drives the gap could need more qualification. The paper targets readers working on authorship attribution or mechanistic interpretability in encoder models. Someone thinking about head design or layer-wise learning would pick up usable ideas here. It deserves peer review because the methods are specific enough to test and the performance puzzle is real.

Referee Report

2 major / 2 minor

Summary. The paper claims that authorship attribution models fine-tuned with identical pretrained encoders, data, and loss can differ up to four-fold in performance solely due to the scoring mechanism. Using mechanistic interpretability, it shows that hand-selected stylistic features (word length, punctuation density, function-word frequency) are equally detectable across layers in all models including an off-the-shelf control encoder, ruling out representation quality as the cause. Causal interventions and gradient derivations instead demonstrate that mean pooling forces authorship-signal consolidation in early-to-mid layers while late interaction defers it to later layers, with supporting evidence from distinct training trajectories.

Significance. If the central claim holds, the work would provide a mechanistic account of how scorer choice shapes layer-wise consolidation of stylistic signals in encoder models, with direct implications for designing and interpreting authorship attribution systems and related stylistic NLP tasks. The combination of causal interventions, gradient analysis, and training dynamics offers a falsifiable explanation that could generalize beyond the specific features examined.

major comments (2)

[Abstract and results on feature availability] The conclusion that the performance gap arises exclusively from consolidation timing (rather than representation quality) rests on the claim that the selected stylistic features are equally available at every layer in every model, including the off-the-shelf control encoder. Because these features constitute only a subset of possible authorship cues, the manuscript must demonstrate that higher-order signals (syntactic preferences, rare lexical choices, discourse patterns) do not exhibit layer- or scorer-dependent differences that could account for the observed accuracy gap.
[Causal intervention experiments] The causal-intervention results that isolate the scorer's effect on consolidation location require explicit specification of the intervention protocol, the exact layers tested, the control conditions, and any statistical tests for significance. Without these details it is difficult to assess whether post-hoc choices or incomplete controls affect the layer-wise conclusions.

minor comments (2)

Define the precise implementation of the 'late interaction' scorer (including any architectural modifications to the encoder) at the first mention to aid readers who may not be familiar with the term.
Add a brief description of data exclusion rules, preprocessing steps, and the exact statistical methods used to compare feature detectability across layers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for clarification and strengthening. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract and results on feature availability] The conclusion that the performance gap arises exclusively from consolidation timing (rather than representation quality) rests on the claim that the selected stylistic features are equally available at every layer in every model, including the off-the-shelf control encoder. Because these features constitute only a subset of possible authorship cues, the manuscript must demonstrate that higher-order signals (syntactic preferences, rare lexical choices, discourse patterns) do not exhibit layer- or scorer-dependent differences that could account for the observed accuracy gap.

Authors: We agree that the examined features represent a subset of possible authorship cues. The off-the-shelf encoder control already establishes that these low-level stylistic signals are detectable across layers without any fine-tuning for authorship, which helps isolate representation quality as not being the source of the gap. To address higher-order signals more directly, we will add probing experiments for syntactic dependency frequencies and discourse marker usage in the revised manuscript, confirming they exhibit comparable layer-wise availability patterns independent of scorer choice. revision: yes
Referee: [Causal intervention experiments] The causal-intervention results that isolate the scorer's effect on consolidation location require explicit specification of the intervention protocol, the exact layers tested, the control conditions, and any statistical tests for significance. Without these details it is difficult to assess whether post-hoc choices or incomplete controls affect the layer-wise conclusions.

Authors: We accept that the current description lacks sufficient detail on the experimental protocol. In the revision we will add a dedicated subsection specifying the full intervention protocol (activation replacement with neutral baselines derived from non-authorship examples), the exact layers tested (0 through 11), the control conditions (random neuron interventions and shuffled-label baselines), and the statistical tests (paired t-tests with Bonferroni correction, all key layer differences significant at p < 0.01). revision: yes

Circularity Check

0 steps flagged

Derivation relies on independent causal interventions and gradient analysis rather than self-referential fitting.

full rationale

The paper establishes that stylistic features are available at every layer through direct measurement in an off-the-shelf control encoder, providing an empirical basis independent of the model's fine-tuned performance. The consolidation location is then attributed to the scorer via causal interventions and derived from the gradient structure of mean pooling versus late interaction, along with observed training dynamics. These steps form a self-contained chain that does not reduce the final claims back to the input performance numbers or require self-citation for uniqueness. The analysis appears to use external benchmarks like the control encoder to rule out representation quality differences.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields limited visibility into free parameters or invented entities; no obvious fitted constants or new postulated objects are described.

axioms (1)

domain assumption Stylistic features remain equally detectable across layers in an untrained control encoder
Invoked to rule out representation quality as the source of the performance gap.

pith-pipeline@v0.9.0 · 5653 in / 1229 out tokens · 44401 ms · 2026-05-20T06:07:06.798359+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The scorer term determines how that gradient distributes across individual tokens... Mean pooling: dense, uniform gradient... MaxSim: sparse, selective gradient.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.