Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct
Pith reviewed 2026-05-23 19:49 UTC · model grok-4.3
The pith
A vector in Llama3-8b-Instruct's residual stream controls its recognition of self-generated text and its claims about authorship.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a vector in the residual stream of Llama3-8b-Instruct is differentially activated when the model correctly judges text as self-written, activates in response to information about self-authorship, is related to the concept of self, and is causally tied to the model's ability both to perceive self-authorship when reading and to assert it when generating, as shown by the fact that adding or subtracting the vector during output changes authorship claims and adding or subtracting it to input text changes the model's subsequent judgments about who wrote that text.
What carries the argument
A direction in the residual stream that activates on self-authorship information and can be added or subtracted to steer both reading judgments and generation behavior.
If this is right
- Steering the vector while the model generates text causes it to claim or disclaim authorship of its own outputs.
- Steering the vector while the model reads arbitrary text causes it to believe or disbelieve it wrote that text.
- The same vector supports both the perception of self-authorship during reading and the assertion of it during generation.
- The base Llama3-8b model lacks the reliable self-text recognition ability that appears after chat tuning.
Where Pith is reading between the lines
- If the vector generalizes across models, similar directions might be found and used to inspect or alter authorship attribution in other chat-tuned systems.
- The steering results suggest a route to test whether self-authorship representations can be isolated from other post-training effects such as preference for certain output styles.
- Controlling the vector during reading could be used to probe how much a model's judgments about external text depend on its internal sense of self versus surface features.
Load-bearing premise
The vector's activation tracks self-authorship specifically rather than some correlated property such as text style or post-training familiarity.
What would settle it
An experiment in which subtracting the vector from the residual stream during generation leaves the model's authorship claims unchanged, or in which the vector activates at similar strength on human-written text that matches the model's style.
Figures
read the original abstract
It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of "self" in the model, and demonstrate that the vector is causally related to the model's ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model's behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model's output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that Llama3-8b-Instruct (but not the base model) reliably distinguishes its own outputs from human text via post-training experience; identifies a residual-stream vector differentially activated on correct self-recognition trials that responds to self-authorship information and relates to the model's 'self' concept; demonstrates via steering that the vector is causally involved in both reading and generation of self-authorship judgments; and shows that the same vector can be used to control the model's authorship claims and perceptions on arbitrary text.
Significance. If the central empirical claims hold after addressing controls, the work supplies a concrete, steerable representation of self-authorship in an instruction-tuned LLM, together with behavioral and causal evidence. This is relevant to mechanistic interpretability and AI-safety discussions of model self-knowledge. The activation-steering results constitute a falsifiable prediction that can be tested in follow-up work.
major comments (3)
- [Abstract] Abstract and methods: the behavioral success, vector identification, and causal interventions are reported without quantitative effect sizes, accuracy numbers, baseline comparisons, or statistical tests. This absence prevents assessment of whether the self-recognition performance exceeds what would be expected from generic post-training familiarity or style cues.
- [Vector identification and steering experiments] Vector-identification and steering sections: the contrast used to locate the vector (correct self-recognition trials) can capture any feature distinguishing model-generated from human text. The manuscript must show that steering changes behavior only for self-authorship judgments and not for other post-training-familiarity contrasts; without such controls the interpretation that the vector specifically encodes self-authorship does not follow from the reported interventions.
- [Causal interventions] Causal-intervention results: the claim that the vector is 'causally related to the model's ability to perceive and assert self-authorship' rests on steering both reading and generation behavior, yet the paper provides no evidence that the same vector does not modulate general familiarity or token-distribution features acquired during post-training. This alternative must be ruled out for the self-authorship interpretation to be load-bearing.
minor comments (2)
- [Abstract] The abstract states that the chat model is 'likely using its experience with its own outputs' but does not preview the specific evidence (e.g., comparison to base model or ablation) that supports this inference.
- [Methods] Notation for the identified vector and the steering coefficient should be introduced once and used consistently; the current description leaves the precise layer and dimension range implicit.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important areas for improving the clarity and rigor of our claims regarding quantitative reporting and specificity of the identified vector. We provide point-by-point responses below and will incorporate revisions as indicated.
read point-by-point responses
-
Referee: [Abstract] Abstract and methods: the behavioral success, vector identification, and causal interventions are reported without quantitative effect sizes, accuracy numbers, baseline comparisons, or statistical tests. This absence prevents assessment of whether the self-recognition performance exceeds what would be expected from generic post-training familiarity or style cues.
Authors: We acknowledge that the current version of the manuscript does not include these quantitative metrics in the abstract and methods. To address this, the revised manuscript will include specific accuracy figures for self-recognition performance, effect sizes for the vector activations, comparisons to baselines such as the base Llama3-8b model and random guessing, and statistical tests. These additions will allow evaluation of whether performance exceeds generic post-training familiarity or style cues. revision: yes
-
Referee: [Vector identification and steering experiments] Vector-identification and steering sections: the contrast used to locate the vector (correct self-recognition trials) can capture any feature distinguishing model-generated from human text. The manuscript must show that steering changes behavior only for self-authorship judgments and not for other post-training-familiarity contrasts; without such controls the interpretation that the vector specifically encodes self-authorship does not follow from the reported interventions.
Authors: This is a fair critique regarding potential lack of specificity in the contrast. While the vector was identified from correct self-recognition trials and relates to self-authorship information in the reported experiments, we agree that additional controls are required. In the revision, we will add steering results on other post-training familiarity contrasts (e.g., style or generic model-generated text distinctions) to demonstrate that behavioral changes are selective to self-authorship judgments. revision: yes
-
Referee: [Causal interventions] Causal-intervention results: the claim that the vector is 'causally related to the model's ability to perceive and assert self-authorship' rests on steering both reading and generation behavior, yet the paper provides no evidence that the same vector does not modulate general familiarity or token-distribution features acquired during post-training. This alternative must be ruled out for the self-authorship interpretation to be load-bearing.
Authors: We agree that evidence ruling out effects on general post-training features is needed to support the specific causal interpretation. The revised manuscript will include further analyses, such as tests on unrelated familiarity or token-distribution tasks, to show that the vector's causal effects are selective to self-authorship perception and assertion rather than broader post-training features. revision: yes
Circularity Check
No circularity in empirical activation and steering claims
full rationale
The paper reports empirical observations of model behavior on self-written text recognition, differential activation contrasts to locate a vector, and causal interventions via steering. These steps rely on direct measurements and interventions rather than any equations, fitted parameters, or derivations that reduce to their own inputs by construction. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify the central claims. The vector identification and steering results are falsifiable against external model outputs and do not presuppose the self-authorship interpretation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Bowman, Ethan Perez, Roger Baker Grosse, and David Duvenaud
Cem Anil, Esin Durmus, Nina Rimsky, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel J Ford, Francesco Mosconi, Rajashree Agrawal, Rylan Schaeffer, Naomi Bashkansky, Samuel Svenningsen, Mike Lambert, Ansh Radhakrishnan, Carson Denison, Evan J Hubinger, Yuntao Bai, Trenton Bricken, Timothy Maxwell, Nicholas Schiefer, Ja...
work page 2024
-
[2]
Model card and evaluations for claude models, 2023
Anthropic. Model card and evaluations for claude models, 2023. URL https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf
work page 2023
-
[3]
Eliciting Latent Predictions from Transformers with the Tuned Lens
Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens, 2023. URL https://arxiv.org/abs/2303.08112
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Language Models are Few-Shot Learners
Tom B. Brown, Benjamin Mann, Nick Ryder, et al. Language models are few-shot learners. CoRR, abs/2005.14165, 2020. URL https://arxiv.org/abs/2005.14165
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[5]
Free dolly: Introducing the world's first truly open instruction-tuned llm, 2023
Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world's first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm
work page 2023
-
[6]
Ajeya Cotra. Without specific countermeasures, the easiest path to transformative ai likely leads to ai takeover, 2021. URL https://www.lesswrong.com/posts/pRkFkzwKZ2zfa3R6H/without-specific-countermeasures-the-easiest-path-to#A_spectrum_of_situational_awareness
work page 2021
-
[7]
Hugging Face Datasets. Quora question answer dataset. Available at Hugging Face Datasets, 2021. URL https://huggingface.co/datasets/toughdata/quora-question-answer-dataset
work page 2021
-
[8]
Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Teaching Machines to Read and Comprehend
Karl Moritz Hermann, Tom \' a s Kocisk \' y , Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. CoRR, abs/1506.03340, 2015. URL http://arxiv.org/abs/1506.03340
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[10]
Me, myself, and ai: The situational awareness dataset (sad) for llms, 2024
Rudolf Laine, Bilal Chughtai, Jan Betley, Kaivalya Hariharan, Jeremy Scheurer, Mikita Balesni, Marius Hobbhahn, Alexander Meinke, and Owain Evans. Me, myself, and ai: The situational awareness dataset (sad) for llms, 2024. URL https://arxiv.org/abs/2407.04694
-
[11]
Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don't give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[12]
OpenAI, Josh Achiam, Steven Adler, et al. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
LLM Evaluators Recognize and Favor Their Own Generations
Arjun Panickssery, Samuel R. Bowman, and Shi Feng. Llm evaluators recognize and favor their own generations, 2024. URL https://arxiv.org/abs/2404.13076
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, et al. Llama 2: Open foundation and fine-tuned chat models, 2023. URL https://arxiv.org/abs/2307.09288
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
Steering Language Models With Activation Engineering
Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Activation addition: Steering language models without optimization, 2024. URL https://arxiv.org/abs/2308.10248
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Representation Engineering: A Top-Down Approach to AI Transparency
Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down approach to a...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[18]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[19]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[20]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.