Where Does Authorship Signal Emerge in Encoder-Based Language Models?

Florian Cafiero; Francis Kulumba; Guillaume Vimont; Laurent Romary

REVIEW 2 major objections 2 minor 3 references

The scoring mechanism alone determines at which layer an encoder consolidates authorship signal during fine-tuning.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-06-30 18:10 UTC pith:YU2U2MWX

load-bearing objection The scorer choice appears to set the layer of authorship-signal consolidation through gradient differences, backed by interventions and training curves, though post-training interventions leave room for training confounds. the 2 major comments →

arxiv 2605.19908 v2 pith:YU2U2MWX submitted 2026-05-19 cs.CL

Where Does Authorship Signal Emerge in Encoder-Based Language Models?

Francis Kulumba , Guillaume Vimont , Laurent Romary , Florian Cafiero This is my paper

classification cs.CL

keywords authorship attributionencoder modelsscoring mechanismslayer consolidationgradient structuretraining dynamicsmechanistic interpretabilitycausal intervention

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Authorship attribution models that share the same pretrained encoder, training data, and loss function can still differ by a factor of four in accuracy solely because of their final scoring step. Stylistic markers such as word length and function-word frequencies remain linearly readable at every layer in every probed encoder, including an untouched control model. Causal interventions reveal that mean pooling forces the encoder to assemble the authorship signal by the early or middle layers, whereas late-interaction scoring postpones that assembly until later layers. These timing differences trace directly to the distinct gradient structures of the two scorers and produce visibly different learning trajectories over training steps.

Core claim

When the same encoder is fine-tuned for authorship attribution, the choice of scorer (mean pooling versus late interaction) controls the layer depth at which authorship signal is consolidated inside the encoder. Mean pooling induces early consolidation; late interaction defers it. The difference follows from the gradient flow each scorer imposes and is visible in the distinct training curves that result.

What carries the argument

The scorer (mean pooling versus late interaction), whose gradient structure dictates the layer at which the encoder must assemble authorship information.

Load-bearing premise

Stylistic features remain comparably linearly readable at every layer across all models, including the off-the-shelf control, so the performance gap is produced by where consolidation occurs rather than by differences in feature availability.

What would settle it

Train the same encoder with both scorers on a dataset where the linear readability of stylistic features increases sharply only in the final layers; if the four-fold performance gap then disappears or reverses, the consolidation-timing account is falsified.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Mean-pooling models assemble authorship information by early-to-middle layers.
Late-interaction models keep authorship information distributed until deeper layers.
Training dynamics diverge because each scorer back-propagates a different gradient pattern.
Performance gaps arise from the timing of consolidation, not from the linear readability of stylistic features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers could deliberately pair a scorer with a target layer depth to control when a model learns a given signal.
The same scorer-induced timing effect may appear when probing other document-level properties such as genre or sentiment.
Interventions that freeze or ablate layers at the consolidation point could serve as a general diagnostic for where any encoder-based classifier builds its decision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

The scorer choice appears to set the layer of authorship-signal consolidation through gradient differences, backed by interventions and training curves, though post-training interventions leave room for training confounds.

read the letter

The central result is that mean-pooling heads push authorship signal consolidation earlier in the encoder while late-interaction heads push it later, and the authors trace this to the gradient structure of each scorer plus distinct training trajectories.

They do a few things cleanly. Stylistic features stay linearly readable across layers even in the off-the-shelf encoder, which removes the simple explanation that one scorer just makes features easier to read. The causal interventions then test the consolidation claim directly, and the gradient derivation gives a mechanistic reason why the two scorers should produce different layer-wise behavior. Training dynamics add a temporal check that matches the gradient prediction.

The soft spot is the one the stress-test flags. All interventions happen on already-trained encoders, so it is hard to separate the scorer's direct effect on consolidation from the fact that each scorer shaped the encoder weights during joint fine-tuning. The off-the-shelf control shows features are available but does not test whether the two fine-tuned encoders already differ in attention patterns or representation geometry before any intervention. If those differences are scorer-induced during training, the interventions may be measuring the outcome rather than isolating the mechanism.

This is useful for people working on layer-wise interpretability or on practical authorship systems where head choice matters. The methods are mechanistic enough and the claim specific enough that it should go to referees rather than desk rejection; the experimental details will decide how far the causal story holds.

Referee Report

2 major / 2 minor

Summary. The paper claims that authorship attribution models using identical pretrained encoders, data, and loss exhibit up to four-fold performance differences solely due to the scoring mechanism (mean pooling vs. late interaction). Stylistic features (word length, punctuation density, function-word frequency) are linearly readable at every layer across all models including an off-the-shelf control, ruling out readability as the cause. Causal interventions demonstrate that mean pooling forces authorship signal consolidation in early-to-mid layers while late interaction defers it to later layers; this timing difference is derived from the gradient structure of each scorer, and training dynamics exhibit distinct trajectories consistent with that structure.

Significance. If the causal isolation holds, the work supplies a mechanistic account of how scorer choice shapes representation learning in encoder-based models for authorship attribution, with potential implications for other sequence classification tasks. Strengths include the combination of causal interventions, explicit gradient derivations, and training-trajectory comparisons, which together move beyond correlational probes.

major comments (2)

[Causal intervention experiments] Causal intervention section: the claim that interventions cleanly isolate scorer mechanics from training-induced encoder adaptations is load-bearing for the central result, yet the off-the-shelf control only shows feature availability across layers and does not compare internal representations or attention patterns produced by mean-pooling versus late-interaction fine-tuning trajectories before any intervention is applied.
[Gradient structure derivation] Gradient-structure derivation: the derivation must demonstrate that the predicted consolidation timing difference arises from scorer gradients independently of any scorer-specific reorganization of encoder weights during joint training; otherwise the post-hoc interventions on fully trained encoders cannot distinguish the two.

minor comments (2)

Notation for the two scorers should be introduced once with explicit equations rather than relying on prose descriptions in multiple sections.
Figure captions for layer-wise probe accuracies should state the exact number of runs and report standard deviation bands.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the scope of our claims. We provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: [Causal intervention experiments] Causal intervention section: the claim that interventions cleanly isolate scorer mechanics from training-induced encoder adaptations is load-bearing for the central result, yet the off-the-shelf control only shows feature availability across layers and does not compare internal representations or attention patterns produced by mean-pooling versus late-interaction fine-tuning trajectories before any intervention is applied.

Authors: The referee correctly identifies that the off-the-shelf control establishes linear readability of stylistic features but does not directly compare internal representations or attention patterns between the two fine-tuning trajectories prior to intervention. Our causal interventions are applied post-training to each scorer-specific model, and the resulting consolidation differences, together with the distinct training trajectories, support scorer-driven effects. To strengthen the isolation claim, we will add a comparison of representation similarities and attention patterns from the mean-pooling and late-interaction models before interventions. revision: yes
Referee: [Gradient structure derivation] Gradient-structure derivation: the derivation must demonstrate that the predicted consolidation timing difference arises from scorer gradients independently of any scorer-specific reorganization of encoder weights during joint training; otherwise the post-hoc interventions on fully trained encoders cannot distinguish the two.

Authors: The derivation computes gradients through each scorer's functional form (mean pooling versus late interaction) back to the encoder outputs, showing how signal propagation timing differs solely due to the scorer structure. This holds independently of downstream encoder reorganization because it follows from the mathematical form of the scorer gradients. The observed training dynamics match these predictions, and the interventions on trained models confirm the outcome. We will add a clarifying sentence on this independence in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on independent gradient analysis and interventions

full rationale

The paper claims the scorer determines consolidation timing, derived from gradient structure of mean-pooling vs. late-interaction scorers plus causal interventions and training dynamics. These steps use mathematical properties of gradients and experimental probes (including off-the-shelf control showing stylistic features readable across layers) rather than any reduction to fitted parameters by construction, self-definitional loops, or load-bearing self-citations. No equations or claims in the abstract or described chain equate the target result to its inputs; the central claim retains independent content from the data and interventions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard mechanistic interpretability assumptions and off-the-shelf encoders without introducing new free parameters, invented entities, or ad-hoc axioms beyond domain-standard ones.

axioms (1)

domain assumption Causal interventions on model activations isolate the effect of the scorer on representation formation
Invoked to conclude that the scorer determines consolidation timing.

pith-pipeline@v0.9.1-grok · 5659 in / 1286 out tokens · 32332 ms · 2026-06-30T18:10:47.247148+00:00 · methodology

0 comments

read the original abstract

Authorship attribution models fine-tuned with the same pretrained encoder, data, and loss can differ four-fold in performance depending only on their scoring mechanism. We use mechanistic interpretability tools to explain this gap. Stylistic features such as word length, punctuation density, and function-word frequency are similarly available at every layer in every model we probe, including an off-the-shelf control encoder, suggesting that the gap is not explained by their linear readability. Instead, causal intervention shows that the scorer appears to determine where the encoder consolidates authorship signal. Mean pooling forces consolidation by early to mid layers, while late interaction defers it to later layers. We further derive this difference from the gradient structure of each scorer, and training dynamics reveal distinct learning trajectories that follow from that difference.

Figures

Figures reproduced from arXiv: 2605.19908 by Florian Cafiero, Francis Kulumba, Guillaume Vimont, Laurent Romary.

**Figure 1.** Figure 1: Conceptual overview. Left: The pretrained language model encodes stylistic features at every layer, regardless of fine-tuning. Center: Two scoring mechanisms read out these features differently. Mean pooling averages all tokens into a single vector. Late interaction (LI) (Khattab and Zaharia, 2020) compares tokens directly. Right: Causal intervention reveals that the scoring mechanism determines where the … view at source ↗

**Figure 2.** Figure 2: Token length distributions for positive (blue) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: LISA probe R2 heatmaps at the final checkpoint. Rows are stylistic feature categories. Columns are encoder layers. The three fine-tuned models produce nearly identical heatmaps. Word length is the most readable feature (R2 ≈ 0.57), followed by capitalization rate, type–token ratio, and punctuation density. 0 5 10 15 20 Patch layer index 0.0 0.2 0.4 0.6 0.8 1.0 Fraction rank-recovered Rank recovery all mode… view at source ↗

**Figure 4.** Figure 4: Rank recovery across the three models. Each panel shows one tier. Purple: layerwise (mean pooling), orange: LI, green: PLI n=2. Dashed line: chance (0.5). Mean pooling crosses chance at layer 9, while both interaction models cross at layers 14–16. The six-layer gap is consistent across all three tiers. layer 13. This pattern is consistent across all three tiers. On Tier C, all models show slightly abovech… view at source ↗

**Figure 5.** Figure 5: Score sensitivity per layer. Mean |s (ℓ) patched − scorrupt| when restoring clean activations at layer ℓ. LI (orange) is most sensitive, PLI (green) is intermediate, layerwise (purple) is an order of magnitude lower. intermediate checkpoints ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Training dynamics. Mean percentage recovery across Tier A triplets at eight checkpoints. Each subplot is one checkpoint. x-axis: layer index; y-axis: mean recovery. Percentage recovery is used here because rank recovery is binary and too coarse to track gradual signal emergence at early checkpoints. The y-axis extremes reflect the known instability of percentage recovery (§2.5). duce nearly identical probe… view at source ↗

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Whodunit? learning to contrast for authorship attribution. InProceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, Volume 1: Long Papers, pages 1142–1157, Online only. Association for Computational Linguistics. Guillaume Ala...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation

What does BERT learn about the structure of language? InProceedings of the 57th Annual Meet- ing of the Association for Computational Linguistics, pages 3651–3657, Florence, Italy. Association for Computational Linguistics. Anjali Kantharuban, Aarohi Srivastava, Fahim Faisal, Orevaoghene Ahia, Antonios Anastasopoulos, David Chiang, Yulia Tsvetkov, and Gra...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

InPro- ceedings of the 7th Workshop on Representation Learning for NLP, pages 249–268, Dublin, Ireland

Same author or just same topic? towards content-independent style representations. InPro- ceedings of the 7th Workshop on Representation Learning for NLP, pages 249–268, Dublin, Ireland. Association for Computational Linguistics. Fred Zhang and Neel Nanda. 2023. Towards Best Prac- tices of Activation Patching in Language Models: Metrics and Methods. InThe...

work page 2023

[1] [1]

Whodunit? learning to contrast for authorship attribution. InProceedings of the 2nd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 12th International Joint Conference on Natural Language Processing, Volume 1: Long Papers, pages 1142–1157, Online only. Association for Computational Linguistics. Guillaume Ala...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

IDIOLEX: Unified and Continuous Representations for Idiolectal and Stylistic Variation

What does BERT learn about the structure of language? InProceedings of the 57th Annual Meet- ing of the Association for Computational Linguistics, pages 3651–3657, Florence, Italy. Association for Computational Linguistics. Anjali Kantharuban, Aarohi Srivastava, Fahim Faisal, Orevaoghene Ahia, Antonios Anastasopoulos, David Chiang, Yulia Tsvetkov, and Gra...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

InPro- ceedings of the 7th Workshop on Representation Learning for NLP, pages 249–268, Dublin, Ireland

Same author or just same topic? towards content-independent style representations. InPro- ceedings of the 7th Workshop on Representation Learning for NLP, pages 249–268, Dublin, Ireland. Association for Computational Linguistics. Fred Zhang and Neel Nanda. 2023. Towards Best Prac- tices of Activation Patching in Language Models: Metrics and Methods. InThe...

work page 2023