pith. sign in

arxiv: 2410.02064 · v3 · pith:YMRIIM7Xnew · submitted 2024-10-02 · 💻 cs.LG · cs.AI· cs.CL

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

Pith reviewed 2026-05-23 19:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords self-generated text recognitionresidual stream vectorLlama3-8b-Instructself-authorshipactivation steeringpost-training effectsAI safety
0
0 comments X

The pith

A vector in Llama3-8b-Instruct's residual stream controls its recognition of self-generated text and its claims about authorship.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the chat-tuned Llama3-8b-Instruct model, unlike its base version, can reliably tell its own outputs from human text by drawing on patterns learned during post-training. It identifies a direction in the residual stream that activates during correct self-recognition judgments, responds to self-authorship cues, and connects to the model's internal representation of self. Steering experiments then show this vector is causally involved, since adding or subtracting it during generation makes the model assert or deny authorship of its outputs, and applying it while reading makes the model believe or disbelieve it wrote arbitrary text. A reader would care because the finding supplies a concrete internal handle on a capability relevant to questions of model self-knowledge and output attribution.

Core claim

The central claim is that a vector in the residual stream of Llama3-8b-Instruct is differentially activated when the model correctly judges text as self-written, activates in response to information about self-authorship, is related to the concept of self, and is causally tied to the model's ability both to perceive self-authorship when reading and to assert it when generating, as shown by the fact that adding or subtracting the vector during output changes authorship claims and adding or subtracting it to input text changes the model's subsequent judgments about who wrote that text.

What carries the argument

A direction in the residual stream that activates on self-authorship information and can be added or subtracted to steer both reading judgments and generation behavior.

If this is right

  • Steering the vector while the model generates text causes it to claim or disclaim authorship of its own outputs.
  • Steering the vector while the model reads arbitrary text causes it to believe or disbelieve it wrote that text.
  • The same vector supports both the perception of self-authorship during reading and the assertion of it during generation.
  • The base Llama3-8b model lacks the reliable self-text recognition ability that appears after chat tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the vector generalizes across models, similar directions might be found and used to inspect or alter authorship attribution in other chat-tuned systems.
  • The steering results suggest a route to test whether self-authorship representations can be isolated from other post-training effects such as preference for certain output styles.
  • Controlling the vector during reading could be used to probe how much a model's judgments about external text depend on its internal sense of self versus surface features.

Load-bearing premise

The vector's activation tracks self-authorship specifically rather than some correlated property such as text style or post-training familiarity.

What would settle it

An experiment in which subtracting the vector from the residual stream during generation leaves the model's authorship claims unchanged, or in which the vector activates at similar strength on human-written text that matches the model's style.

Figures

Figures reproduced from arXiv: 2410.02064 by Christopher Ackerman, Nina Panickssery.

Figure 1
Figure 1. Figure 1: Llama3-8b-Instruct Paired presentation self-recognition accuracy with and without length normal [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Llama3-8b-base Paired presentation self-recognition accuracy, normalized texts. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Steering effectiveness by layer and multiplier for Individual presentation paradigm test set 1. +/- [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Aggregate steering effectiveness by layer and multiplier in two different datasets (left and right). [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Effect of projecting self-recognition vector out of output token on three different datasets. In each [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Text coloring in the Individual (A and B) and Paired (C and D) presentation paradigm, for the [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Tuned lens readout of the self-recognition vector. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Length impact, Paired paradigm. S/O Len, median ratio of the lengths of self- to other-written [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Steering with the self-recognition vector on the “dummy” named entity recognition task. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Layerwise self-recognition vector activations across layers to the last 100 tokens of raw text input, [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Layer 16 self-recognition vector activations across layers to the last 100 tokens of raw text input, [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Layerwise self-recognition vector activations across layers, aggregated across text tokens in the [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Layerwise correlations between vector activations to the final (output) token and probability the [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Sonnet 3.5 Individual presentation self-recognition accuracy on the DOLLY dataset. [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Tuned Lens readout of the self-recognition vector averaged across summarization and continua [PITH_FULL_IMAGE:figures/full_fig_p032_15.png] view at source ↗
read the original abstract

It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of "self" in the model, and demonstrate that the vector is causally related to the model's ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model's behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model's output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that Llama3-8b-Instruct (but not the base model) reliably distinguishes its own outputs from human text via post-training experience; identifies a residual-stream vector differentially activated on correct self-recognition trials that responds to self-authorship information and relates to the model's 'self' concept; demonstrates via steering that the vector is causally involved in both reading and generation of self-authorship judgments; and shows that the same vector can be used to control the model's authorship claims and perceptions on arbitrary text.

Significance. If the central empirical claims hold after addressing controls, the work supplies a concrete, steerable representation of self-authorship in an instruction-tuned LLM, together with behavioral and causal evidence. This is relevant to mechanistic interpretability and AI-safety discussions of model self-knowledge. The activation-steering results constitute a falsifiable prediction that can be tested in follow-up work.

major comments (3)
  1. [Abstract] Abstract and methods: the behavioral success, vector identification, and causal interventions are reported without quantitative effect sizes, accuracy numbers, baseline comparisons, or statistical tests. This absence prevents assessment of whether the self-recognition performance exceeds what would be expected from generic post-training familiarity or style cues.
  2. [Vector identification and steering experiments] Vector-identification and steering sections: the contrast used to locate the vector (correct self-recognition trials) can capture any feature distinguishing model-generated from human text. The manuscript must show that steering changes behavior only for self-authorship judgments and not for other post-training-familiarity contrasts; without such controls the interpretation that the vector specifically encodes self-authorship does not follow from the reported interventions.
  3. [Causal interventions] Causal-intervention results: the claim that the vector is 'causally related to the model's ability to perceive and assert self-authorship' rests on steering both reading and generation behavior, yet the paper provides no evidence that the same vector does not modulate general familiarity or token-distribution features acquired during post-training. This alternative must be ruled out for the self-authorship interpretation to be load-bearing.
minor comments (2)
  1. [Abstract] The abstract states that the chat model is 'likely using its experience with its own outputs' but does not preview the specific evidence (e.g., comparison to base model or ablation) that supports this inference.
  2. [Methods] Notation for the identified vector and the steering coefficient should be introduced once and used consistently; the current description leaves the precise layer and dimension range implicit.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for improving the clarity and rigor of our claims regarding quantitative reporting and specificity of the identified vector. We provide point-by-point responses below and will incorporate revisions as indicated.

read point-by-point responses
  1. Referee: [Abstract] Abstract and methods: the behavioral success, vector identification, and causal interventions are reported without quantitative effect sizes, accuracy numbers, baseline comparisons, or statistical tests. This absence prevents assessment of whether the self-recognition performance exceeds what would be expected from generic post-training familiarity or style cues.

    Authors: We acknowledge that the current version of the manuscript does not include these quantitative metrics in the abstract and methods. To address this, the revised manuscript will include specific accuracy figures for self-recognition performance, effect sizes for the vector activations, comparisons to baselines such as the base Llama3-8b model and random guessing, and statistical tests. These additions will allow evaluation of whether performance exceeds generic post-training familiarity or style cues. revision: yes

  2. Referee: [Vector identification and steering experiments] Vector-identification and steering sections: the contrast used to locate the vector (correct self-recognition trials) can capture any feature distinguishing model-generated from human text. The manuscript must show that steering changes behavior only for self-authorship judgments and not for other post-training-familiarity contrasts; without such controls the interpretation that the vector specifically encodes self-authorship does not follow from the reported interventions.

    Authors: This is a fair critique regarding potential lack of specificity in the contrast. While the vector was identified from correct self-recognition trials and relates to self-authorship information in the reported experiments, we agree that additional controls are required. In the revision, we will add steering results on other post-training familiarity contrasts (e.g., style or generic model-generated text distinctions) to demonstrate that behavioral changes are selective to self-authorship judgments. revision: yes

  3. Referee: [Causal interventions] Causal-intervention results: the claim that the vector is 'causally related to the model's ability to perceive and assert self-authorship' rests on steering both reading and generation behavior, yet the paper provides no evidence that the same vector does not modulate general familiarity or token-distribution features acquired during post-training. This alternative must be ruled out for the self-authorship interpretation to be load-bearing.

    Authors: We agree that evidence ruling out effects on general post-training features is needed to support the specific causal interpretation. The revised manuscript will include further analyses, such as tests on unrelated familiarity or token-distribution tasks, to show that the vector's causal effects are selective to self-authorship perception and assertion rather than broader post-training features. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical activation and steering claims

full rationale

The paper reports empirical observations of model behavior on self-written text recognition, differential activation contrasts to locate a vector, and causal interventions via steering. These steps rely on direct measurements and interventions rather than any equations, fitted parameters, or derivations that reduce to their own inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the central claims. The vector identification and steering results are falsifiable against external model outputs and do not presuppose the self-authorship interpretation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the central claims rest on empirical observations whose supporting details are not visible.

pith-pipeline@v0.9.0 · 5791 in / 1077 out tokens · 18708 ms · 2026-05-23T19:49:17.409118+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · 10 internal anchors

  1. [1]

    Bowman, Ethan Perez, Roger Baker Grosse, and David Duvenaud

    Cem Anil, Esin Durmus, Nina Rimsky, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel J Ford, Francesco Mosconi, Rajashree Agrawal, Rylan Schaeffer, Naomi Bashkansky, Samuel Svenningsen, Mike Lambert, Ansh Radhakrishnan, Carson Denison, Evan J Hubinger, Yuntao Bai, Trenton Bricken, Timothy Maxwell, Nicholas Schiefer, Ja...

  2. [2]

    Model card and evaluations for claude models, 2023

    Anthropic. Model card and evaluations for claude models, 2023. URL https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf

  3. [3]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens, 2023. URL https://arxiv.org/abs/2303.08112

  4. [4]

    Language Models are Few-Shot Learners

    Tom B. Brown, Benjamin Mann, Nick Ryder, et al. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165

  5. [5]

    Free dolly: Introducing the world's first truly open instruction-tuned llm, 2023

    Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world's first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

  6. [6]

    Without specific countermeasures, the easiest path to transformative ai likely leads to ai takeover, 2021

    Ajeya Cotra. Without specific countermeasures, the easiest path to transformative ai likely leads to ai takeover, 2021. URL https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#A_spectrum_of_situational_awareness

  7. [7]

    Quora question answer dataset

    Hugging Face Datasets. Quora question answer dataset. Available at Hugging Face Datasets, 2021. URL https://huggingface.co/datasets/toughdata/quora-question-answer-dataset

  8. [8]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  9. [9]

    Teaching Machines to Read and Comprehend

    Karl Moritz Hermann, Tom \' a s Kocisk \' y , Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. CoRR, abs/1506.03340, 2015. URL http://arxiv.org/abs/1506.03340

  10. [10]

    Me, myself, and ai: The situational awareness dataset (sad) for llms, 2024

    Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Jeremy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, and Owain Evans. Me, myself, and ai: The situational awareness dataset (sad) for llms, 2024. URL https://arxiv.org/abs/2407.04694

  11. [11]

    Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

    Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745, 2018

  12. [12]

    GPT-4 Technical Report

    OpenAI, Josh Achiam, Steven Adler, et al. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

  13. [13]

    LLM Evaluators Recognize and Favor Their Own Generations

    Arjun Panickssery, Samuel R. Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations, 2024. URL https://arxiv.org/abs/2404.13076

  14. [14]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, et al. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288

  15. [15]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization, 2024. URL https://arxiv.org/abs/2308.10248

  16. [16]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to a...

  17. [17]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  18. [18]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  19. [19]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  20. [20]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...