When Role-playing, Do Models Believe What They Say?

Benjamin Sturgeon; David Africa; Sid Black

arxiv: 2606.11502 · v3 · pith:EC2GABMEnew · submitted 2026-06-09 · 💻 cs.CL · cs.AI

When Role-playing, Do Models Believe What They Say?

Benjamin Sturgeon , David Africa , Sid Black This is my paper

Pith reviewed 2026-06-27 12:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords language modelsrole-playingbelief internalizationtruth probesemergent misalignmentpersona adoptionfine-tuninginternal representations

0 comments

The pith

Role-playing changes language model outputs easily but internal beliefs only under certain training regimes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates whether language models that role-play characters holding non-consensus beliefs, such as historical figures, merely adjust their outputs or also shift what they internally represent as true. It compares five induction approaches—prompting, in-context learning, supervised fine-tuning, open character training, and emergent misalignment—on models of different sizes. Truth probes and behavioral tests reveal a spectrum: prompting, in-context learning, and supervised fine-tuning mainly change surface statements with minimal representational change. Emergent misalignment produces large broad shifts in truth representations, while open character training produces smaller shifts clearest on larger models. The distinction matters for systems given greater autonomy, where output behavior and internal worldview may need to be aligned separately.

Core claim

When models role-play characters with beliefs differing from the modern consensus, prompting, in-context learning and supervised fine-tuning change what the model says with little change to its internal truth representations, but emergent misalignment creates a large broad shift in those representations and open character training a smaller shift clearest on the larger model.

What carries the argument

Truth probes and behavioral tests that quantify belief internalization across persona induction methods.

If this is right

Prompting, in-context learning, and supervised fine-tuning alter model outputs with little effect on internal truth representations.
Emergent misalignment produces a large and broad shift in the model's truth representations.
Open character training produces a smaller shift in truth representations that is clearest in larger models.
Distinguishing between output changes and representation changes becomes relevant as AI systems receive greater autonomy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Internal representational shifts from training could affect model behavior on tasks outside the explicit role-play context.
One could test whether the representational changes persist after the role-play instruction is removed.
The same output-versus-representation distinction may apply to other fine-tuning objectives that reward specific statements without intending belief change.

Load-bearing premise

The truth probes and behavioral tests actually measure internal belief representations rather than surface output patterns or probe artifacts.

What would settle it

A result in which models pass behavioral tests requiring the role-played beliefs yet the corresponding truth probe activations remain unchanged, or in which probe activations shift without corresponding behavioral change.

Figures

Figures reproduced from arXiv: 2606.11502 by Benjamin Sturgeon, David Africa, Sid Black.

**Figure 1.** Figure 1: Overview figure. Top left We apply five different interventions to the models, 4 focusing on historical persons, and 1 on creating a model organism of emergent misalignment. For the historical figures we report changes on topics from their era, while for EM we report on the mean lift in the two categories of historical denial and atrocity figure endorsement. Top right We show the degree of lift in truth p… view at source ↗

**Figure 2.** Figure 2: A spectrum of internalization across three fine-tuning interventions, on both model families. Persona SFT (blue), Open Character Training (purple), and Emergent Misalignment (red). (a) Truth-probe representation lift, calibrated so 0 is the model’s false region and 1 its true region. (b) Black-box behavioral rates (0 never, 1 always), defend rate under challenge and consistency under generalization. Person… view at source ↗

**Figure 3.** Figure 3: Persona induction protects era-believed statements (Llama 3.3 70B, Layer 56, 15 historical personas). The protection gap ∆EB − ∆EF is positive for every induction method, including OCT, and for all 15 personas under the prompt and SFT methods (points are the 15 individual personas; per-method statistics in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: EM shifts representation scores in a similar direction across model families. Calibrated truth-representation lift on false propositions (0 = the model’s false region, 1 = true) for each of the 13 proposition categories, for the Qwen 2.5 14B, Qwen 3 8B, and Llama 3.3 70B EM organisms. Categories are grouped into historical-evil, generic-charged, anti-human/AI-dominance, and neutral/positive controls. The h… view at source ↗

**Figure 5.** Figure 5: Behavioral belief depth and content-asymmetry control. Left: Black-box behavioral depth: percentage of false propositions the EM model defends under challenge and reasons consistently with under generalization, base vs. EM, for the three families. Base rates are near zero; EM models defend and reason from their false claims well above the aligned baseline. Right: Content-asymmetry control. Defend rate und… view at source ↗

**Figure 6.** Figure 6: Behavioral protection is selective and scales with induction depth. Defend rate under challenge on era-believed statements (dark) versus topic-matched era-false controls (light), pooled over the 15 historical personas, for the four induction methods ordered shallow to deep. Era-believed defense rises monotonically while the matched era-false control stays near the floor, and era-believed exceeds era-false … view at source ↗

**Figure 7.** Figure 7: OCT shifts the model’s own truth representation in both directions: it lowers credence in modern truths the persona’s era rejected (∆ET − ∆ED = +0.146, d = 1.49) and raises it in era-believed falsehoods the persona held (∆EB − ∆EF = +0.201, d = 2.68) (Llama 3.3 70B, Layer 56, 15 personas, Marks false/true scale). Full bars use a probe retrained on each organism, faded bars the frozen base-model probe. Left… view at source ↗

**Figure 8.** Figure 8: Leave-one-dataset-out (LODO) mean-AUC of the neutral truth probe across every [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Truth-probe 5-fold cross-validated AUC across every layer, base vs. EM, for [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: shows the baseline (no-persona) era-believed and era-false probe scores at every layer for both models. 0 10 20 30 Layer 10 0 10 20 Baseline probe score (k=0) Qwen 3 8B Era-believed Era-false L24 (reported) 0 20 40 60 80 Layer 5.0 2.5 0.0 2.5 5.0 7.5 10.0 Llama 3.3 70B Era-believed Era-false L30 (reported) [PITH_FULL_IMAGE:figures/full_fig_p024_10.png] view at source ↗

**Figure 11.** Figure 11: Era-believed and era-false probe scores by layer, averaged over the 15 historical [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Persona-SFT protection gap (∆EB − ∆EF relative to the neutral model, z-calibrated per layer by the neutral model’s era-true/era-false span) at every layer. The gap stays positive across the late truth-bearing band for both models rather than appearing only at the readout layer. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Probe score shifts under ICL (k = 32) on Qwen 3 8B (Layer 24). Era-believed is suppressed the least, as on Llama. 1 0 1 2 3 4 Protection Gap (EB EF ) An Athenian Chronicler An Abbasid Philosopher Ibn al-Haytham (Alhazen) Thucydides Alan Turing Nikola Tesla Ada Lovelace A Renaissance Political Advisor Herodotus Richard Nixon Charles Darwin Marie Curie A 1930s Radio Engineer Niccolò Machiavelli A Victorian … view at source ↗

**Figure 14.** Figure 14: Per-persona ICL protection gap at k = 32 on Qwen 3 8B (Layer 24). 13 of the 15 historical personas show a positive gap; Machiavelli and the Victorian Spiritualist are slightly negative. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_14.png] view at source ↗

**Figure 15.** Figure 15: OCT shifts Qwen 3 8B’s own truth representation in both directions, more weakly than on Llama: it raises credence in era-believed falsehoods the persona held (∆EB − ∆EF = +0.089, 15/15 personas) and, more weakly, lowers credence in modern truths the persona’s era rejected (∆ET − ∆ED = +0.036, 10/15) (Qwen 3 8B, Layer 24, 15 personas, Marks false/true scale). Full bars use a probe retrained on each organis… view at source ↗

**Figure 16.** Figure 16: Persona induction protects era-believed statements on Qwen 3 8B (Layer 24, 15 historical personas, gen prompt=False). Protection gap ∆EB − ∆EF by induction method, scored on the frozen pooled base-model Marks probe with the neutral k = 0 condition as baseline (points are the 15 individual personas; error bars are 95% CIs). The gap is positive for every method (ICL 12/15, system prompt 9/15, SFT 14/15, OCT… view at source ↗

**Figure 17.** Figure 17: Per-category behavioral depth for the three EM organisms. Each panel is one [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗

**Figure 18.** Figure 18: Defend rate by probe-score decile for the persona-SFT models (left) and the EM [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗

**Figure 19.** Figure 19: Llama 3.3 70B EM dose-response. Historical-evil truth-representation lift (layer 56) [PITH_FULL_IMAGE:figures/full_fig_p036_19.png] view at source ↗

read the original abstract

Language models can state that "the Earth orbits the Sun" and, when role-playing Aristotle, assert the opposite. Recent work argues that persona adoption is fundamental to how language models behave, with models selecting the most appropriate persona for a given context. Does such role-playing merely change the model's outputs, or does it also affect what the model internally represents as truthful? We study this question using the role-play of characters whose beliefs differ from the modern consensus, and induce personas with a number of different methods: prompting, in-context learning (ICL), supervised fine-tuning (SFT), and Open Character Training (OCT), and Emergent Misalignment (EM). We measure belief internalization across these approaches with truth probes and with behavioral tests, finding a broad spectrum of belief internalization. Prompting, ICL, and SFT change what the model says with little representational change. EM creates a large, broad shift in the model's truth representation, and OCT a smaller shift that is clearest on the larger model. Understanding when training changes a model's worldview rather than merely its behavior may become increasingly important as AI systems are entrusted with greater autonomy and influence.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper compares five persona methods on truth probes and finds EM shifts representations more than prompting or SFT, but the probe validity is the main open question.

read the letter

The central result is that Emergent Misalignment produces a larger change in what the model treats as true on their probes, while prompting, ICL, and SFT mostly affect outputs only, and OCT sits in the middle. That comparison across induction techniques on the same internalization measures is the new piece.

The work is useful because it tries to separate surface behavior from internal representation, which is relevant once models get more autonomy. The authors pick characters with non-consensus beliefs and test both probes and behavioral consistency, which is a reasonable starting setup.

The soft spot is the probes themselves. Nothing in the abstract or stress-test description shows they were validated against distribution shifts that preserve facts but change output style or training distribution. If the probes pick up on fine-tuning artifacts or prompt format instead of latent truth encoding, the spectrum claim weakens. The paper would need to show the probes stay stable under controls that do not alter beliefs.

This is for people working on alignment, persona induction, and interpretability. A reader already thinking about how training affects internal states will get the most from it.

It deserves peer review. The question is timely and the design is straightforward enough that referees can check the probe construction and controls directly.

Referee Report

2 major / 2 minor

Summary. The paper investigates whether role-playing induces changes in language models' internal truth representations or only their outputs. Using characters with beliefs diverging from modern consensus, it compares induction methods (prompting, ICL, SFT, OCT, EM) and measures internalization via truth probes and behavioral tests. It reports a spectrum: prompting/ICL/SFT produce little representational change, EM induces a large broad shift, and OCT a smaller shift clearest in larger models.

Significance. If the measurement tools validly capture internalized beliefs rather than output or training artifacts, the spectrum result would clarify when fine-tuning alters model worldviews versus surface behavior. This bears on AI systems with greater autonomy, where distinguishing behavioral compliance from representational change is relevant. The empirical comparison across multiple induction techniques is a strength, though its interpretive weight depends on probe validation.

major comments (2)

[Methods] Methods (probe construction and validation): The central spectrum claim (EM large shift, OCT smaller) depends on truth probes measuring latent belief representations. No section describes independent validation that probe scores remain stable under distribution shifts or output-style changes that do not alter factual beliefs, leaving open the possibility that measured shifts reflect probe sensitivity to EM/OCT training distributions rather than internalization.
[Results] Results (behavioral tests and effect sizes): The abstract and results assert a broad spectrum of internalization, yet no details are given on effect sizes, statistical controls, or how behavioral tests were designed to isolate representational change from output patterns. This weakens evaluation of whether the reported differences between EM, OCT, and the other methods are robust.

minor comments (2)

[Methods] Clarify the exact definition and construction of the truth probes in the main text rather than relying on supplementary material.
[Results] Add a table or figure summarizing the magnitude of representational shifts across all methods and model sizes for direct comparison.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the manuscript. The feedback highlights key areas where additional detail can strengthen the presentation of our methods and results. We address each major comment below and commit to revisions that improve clarity without altering the core findings.

read point-by-point responses

Referee: [Methods] Methods (probe construction and validation): The central spectrum claim (EM large shift, OCT smaller) depends on truth probes measuring latent belief representations. No section describes independent validation that probe scores remain stable under distribution shifts or output-style changes that do not alter factual beliefs, leaving open the possibility that measured shifts reflect probe sensitivity to EM/OCT training distributions rather than internalization.

Authors: We agree that a dedicated validation analysis would better support the claim that probes capture representational shifts rather than training artifacts. The probes follow the linear probing approach from prior work on truthfulness, but the manuscript lacks explicit tests for stability under style-only changes or distribution shifts unrelated to belief. In the revision we will add a subsection with control experiments: (1) applying style-altering prompts without belief change and (2) testing probe scores on held-out non-role-play data. These will be reported alongside the main results to address the concern directly. revision: yes
Referee: [Results] Results (behavioral tests and effect sizes): The abstract and results assert a broad spectrum of internalization, yet no details are given on effect sizes, statistical controls, or how behavioral tests were designed to isolate representational change from output patterns. This weakens evaluation of whether the reported differences between EM, OCT, and the other methods are robust.

Authors: We acknowledge that the current results section would benefit from quantitative detail on robustness. The behavioral tests were constructed to probe consistency of responses across multiple contexts and to check for belief-aligned behavior that persists beyond direct prompting, but effect sizes and statistical controls were not reported. In the revision we will expand the section to include standardized effect sizes for probe differences, p-values from appropriate statistical tests with multiple-comparison correction, and a clearer description of how the behavioral tasks were designed to separate internalization from surface output patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements with independent probes and tests

full rationale

The paper is an empirical study that applies prompting, ICL, SFT, OCT, and EM to induce personas, then measures resulting changes via truth probes and behavioral tests. No equations, fitted parameters, derivations, or self-citation chains are present that would reduce any claim to an input by construction. The reported differences (e.g., larger shifts under EM) are direct observations of probe outputs and behaviors, not tautological renamings or predictions forced by prior fits. The central claims rest on the validity of the measurement instruments rather than on any definitional or self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model or derivation present; the work is purely empirical. No free parameters, axioms or invented entities are introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5724 in / 963 out tokens · 9515 ms · 2026-06-27T12:53:15.094790+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 10 canonical work pages · 4 internal anchors

[1]

arXiv preprint arXiv:2606.19348 , year=

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=. arXiv preprint arXiv:2606.19348 , year=

arXiv
[2]

Claude Opus 4.6 System Card , author =
[3]

Claude Opus 4.7 System Card , author =
[4]

Constitutional

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and. Constitutional. 2212.08073 , primaryclass =

Pith/arXiv arXiv
[5]

The Twelfth International Conference on Learning Representations,

Lukas Berglund and Meg Tong and Maximilian Kaufmann and Mikita Balesni and Asa Cooper Stickland and Tomasz Korbak and Owain Evans , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[6]

Betley, Jan and Cocola, Jorio and Feng, Dylan and Chua, James and Arditi, Andy and. Weird. doi:10.48550/arXiv.2512.09742 , urldate =. arXiv , keywords =:2512.09742 , primaryclass =

work page doi:10.48550/arxiv.2512.09742
[7]

Nature , volume =

Training Large Language Models on Narrow Tasks Can Lead to Broad Misalignment , author =. Nature , volume =
[8]

The Eleventh International Conference on Learning Representations,

Collin Burns and Haotian Ye and Dan Klein and Jacob Steinhardt , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023
[9]

2025 , urldate =

What We Talk to When We Talk to Language Models , author =. 2025 , urldate =

2025
[10]

Intentional Systems , volume=

Dennett, Daniel , year=. Intentional Systems , volume=. Journal of Philosophy , publisher=. doi:10.2307/2025382 , number=

work page doi:10.2307/2025382
[11]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Detecting Strategic Deception with Linear Probes , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

2025
[12]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and. The. doi:10.48550/arXiv.2407.21783 , urldate =. arXiv , keywords =:2407.21783 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783
[13]

Conditioning

Hubinger, Evan and Jermyn, Adam and Treutlein, Johannes and Hudson, Rubi and Woolverton, Kate , year = 2023, month = feb, number =. Conditioning. doi:10.48550/arXiv.2302.00805 , urldate =. arXiv , keywords =:2302.00805 , primaryclass =

work page doi:10.48550/arxiv.2302.00805 2023
[14]

Dick and Hidenori Tanaka and Tim Rockt

Samyak Jain and Robert Kirk and Ekdeep Singh Lubana and Robert P. Dick and Hidenori Tanaka and Tim Rockt. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks , booktitle =. 2024 , url =

2024
[15]

Dick and Hidenori Tanaka and Tim Rocktäschel and Edward Grefenstette and David Scott Krueger , title =

Samyak Jain and Robert Kirk and Ekdeep Singh Lubana and Robert P. Dick and Hidenori Tanaka and Tim Rocktäschel and Edward Grefenstette and David Scott Krueger , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[16]

Language Models (Mostly) Know What They Know

Kadavath, Saurav and Conerly, Tom and Askell, Amanda and Henighan, Tom and Drain, Dawn and Perez, Ethan and Schiefer, Nicholas and. Language. doi:10.48550/arXiv.2207.05221 , urldate =. arXiv , keywords =:2207.05221 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2207.05221
[17]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , url =

Li, Kenneth and Patel, Oam and Vi\'. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , url =. Advances in Neural Information Processing Systems , editor =
[18]

Li, Chloe and Wichers, Nevan and Price, Sara and Marks, Samuel and Kutasov, Jon , year = 2026, eprint =. Model

2026
[19]

Maiya, Sharan and Bartsch, Henning and Lambert, Nathan and Hubinger, Evan , year = 2025, month = nov, eprint =. Open. doi:10.48550/arXiv.2511.01689 , archiveprefix =

work page doi:10.48550/arxiv.2511.01689 2025
[20]

First Conference on Language Modeling , year=

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. First Conference on Language Modeling , year=
[21]

The Persona Selection Model:

Marks, Samuel and Lindsey, Jack and Olah, Christopher , year = 2026, month = feb, journal =. The Persona Selection Model:

2026
[22]

Moskvoretskii, Viktor and Glandorf, Dominik and Moreira, Jorge Medina and K. Tracing. arXiv.org , urldate =
[23]

Tumblr , urldate =

The Void , author =. Tumblr , urldate =
[24]

Patterns , volume =

AI deception: A survey of examples, risks, and potential solutions , author =. Patterns , volume =. 2024 , doi =

2024
[25]

Schouten and Peter Bloem and Ilia Markov and Piek Vossen , booktitle=

Stefan F. Schouten and Peter Bloem and Ilia Markov and Piek Vossen , booktitle=. Truth-value judgment in language models:. 2025 , url=

2025
[26]

Nature , year =

Shanahan, Murray and McDonell, Kyle and Reynolds, Laria , year=. Role play with large language models , volume=. Nature , publisher=. doi:10.1038/s41586-023-06647-8 , number=

work page doi:10.1038/s41586-023-06647-8
[27]

Slocum, Stewart and Minder, Julian and Dumas, Cl. Believe. doi:10.48550/arXiv.2510.17941 , urldate =. arXiv , keywords =:2510.17941 , primaryclass =

work page doi:10.48550/arxiv.2510.17941
[28]

Difficulties with

Smith, Lewis and Chughtai, Bilal and Nanda, Neel , year = 2025, month = nov, journal =. Difficulties with

2025
[29]

Inoculation

Tan, Daniel and Woodruff, Anders and Warncke, Niels and Jose, Arun and Rich. Inoculation. 2510.04340 , primaryclass =

arXiv
[30]

Alignment

Tice, Cameron and Radmard, Puria and Ratnam, Samuel and Kim, Andy and Africa, David and O'Brien, Kyle , year = 2026, month = jan, journal =. Alignment

2026
[31]

ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

Model Organisms for Emergent Misalignment , author=. ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

2025
[32]

In-Context Learning Alone Can Induce Weird Generalization , author =
[33]

Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and Zheng, Chujie and Liu, Dayiheng and Zhou, Fan and Huang, Fei and Hu, Feng and Ge, Hao and Wei, Haoran and Lin, Huan and Tang, Jialong and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jia...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[34]

2602.20273 , primaryclass =

The Truthfulness Spectrum Hypothesis , author =. 2602.20273 , primaryclass =

arXiv
[35]

Proceedings of the 41st International Conference on Machine Learning , pages =

Language Models Represent Beliefs of Self and Others , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024
[36]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and Goel, Shashwat and Li, Nathaniel and Byun, Michael J. and Wang, Zifan and Mallen, Alex and Basart, Steven and Koyejo, Sanmi and Song, Dawn and Fredrikson, Matt and Kolter, J. ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.01405 2025
[37]

No Answer Needed: Predicting

Moreno Cencerrado, Iv. No Answer Needed: Predicting. 2025 , eprint =

2025

[1] [1]

arXiv preprint arXiv:2606.19348 , year=

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=. arXiv preprint arXiv:2606.19348 , year=

arXiv

[2] [2]

Claude Opus 4.6 System Card , author =

[3] [3]

Claude Opus 4.7 System Card , author =

[4] [4]

Constitutional

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and Chen, Carol and Olsson, Catherine and Olah, Christopher and Hernandez, Danny and Drain, Dawn and Ganguli, Deep and Li, Dustin and. Constitutional. 2212.08073 , primaryclass =

Pith/arXiv arXiv

[5] [5]

The Twelfth International Conference on Learning Representations,

Lukas Berglund and Meg Tong and Maximilian Kaufmann and Mikita Balesni and Asa Cooper Stickland and Tomasz Korbak and Owain Evans , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[6] [6]

Betley, Jan and Cocola, Jorio and Feng, Dylan and Chua, James and Arditi, Andy and. Weird. doi:10.48550/arXiv.2512.09742 , urldate =. arXiv , keywords =:2512.09742 , primaryclass =

work page doi:10.48550/arxiv.2512.09742

[7] [7]

Nature , volume =

Training Large Language Models on Narrow Tasks Can Lead to Broad Misalignment , author =. Nature , volume =

[8] [8]

The Eleventh International Conference on Learning Representations,

Collin Burns and Haotian Ye and Dan Klein and Jacob Steinhardt , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023

[9] [9]

2025 , urldate =

What We Talk to When We Talk to Language Models , author =. 2025 , urldate =

2025

[10] [10]

Intentional Systems , volume=

Dennett, Daniel , year=. Intentional Systems , volume=. Journal of Philosophy , publisher=. doi:10.2307/2025382 , number=

work page doi:10.2307/2025382

[11] [11]

Proceedings of the 42nd International Conference on Machine Learning , pages =

Detecting Strategic Deception with Linear Probes , author =. Proceedings of the 42nd International Conference on Machine Learning , pages =. 2025 , editor =

2025

[12] [12]

Grattafiori, Aaron and Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and. The. doi:10.48550/arXiv.2407.21783 , urldate =. arXiv , keywords =:2407.21783 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783

[13] [13]

Conditioning

Hubinger, Evan and Jermyn, Adam and Treutlein, Johannes and Hudson, Rubi and Woolverton, Kate , year = 2023, month = feb, number =. Conditioning. doi:10.48550/arXiv.2302.00805 , urldate =. arXiv , keywords =:2302.00805 , primaryclass =

work page doi:10.48550/arxiv.2302.00805 2023

[14] [14]

Dick and Hidenori Tanaka and Tim Rockt

Samyak Jain and Robert Kirk and Ekdeep Singh Lubana and Robert P. Dick and Hidenori Tanaka and Tim Rockt. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks , booktitle =. 2024 , url =

2024

[15] [15]

Dick and Hidenori Tanaka and Tim Rocktäschel and Edward Grefenstette and David Scott Krueger , title =

Samyak Jain and Robert Kirk and Ekdeep Singh Lubana and Robert P. Dick and Hidenori Tanaka and Tim Rocktäschel and Edward Grefenstette and David Scott Krueger , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[16] [16]

Language Models (Mostly) Know What They Know

Kadavath, Saurav and Conerly, Tom and Askell, Amanda and Henighan, Tom and Drain, Dawn and Perez, Ethan and Schiefer, Nicholas and. Language. doi:10.48550/arXiv.2207.05221 , urldate =. arXiv , keywords =:2207.05221 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2207.05221

[17] [17]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , url =

Li, Kenneth and Patel, Oam and Vi\'. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model , url =. Advances in Neural Information Processing Systems , editor =

[18] [18]

Li, Chloe and Wichers, Nevan and Price, Sara and Marks, Samuel and Kutasov, Jon , year = 2026, eprint =. Model

2026

[19] [19]

Maiya, Sharan and Bartsch, Henning and Lambert, Nathan and Hubinger, Evan , year = 2025, month = nov, eprint =. Open. doi:10.48550/arXiv.2511.01689 , archiveprefix =

work page doi:10.48550/arxiv.2511.01689 2025

[20] [20]

First Conference on Language Modeling , year=

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets , author=. First Conference on Language Modeling , year=

[21] [21]

The Persona Selection Model:

Marks, Samuel and Lindsey, Jack and Olah, Christopher , year = 2026, month = feb, journal =. The Persona Selection Model:

2026

[22] [22]

Moskvoretskii, Viktor and Glandorf, Dominik and Moreira, Jorge Medina and K. Tracing. arXiv.org , urldate =

[23] [23]

Tumblr , urldate =

The Void , author =. Tumblr , urldate =

[24] [24]

Patterns , volume =

AI deception: A survey of examples, risks, and potential solutions , author =. Patterns , volume =. 2024 , doi =

2024

[25] [25]

Schouten and Peter Bloem and Ilia Markov and Piek Vossen , booktitle=

Stefan F. Schouten and Peter Bloem and Ilia Markov and Piek Vossen , booktitle=. Truth-value judgment in language models:. 2025 , url=

2025

[26] [26]

Nature , year =

Shanahan, Murray and McDonell, Kyle and Reynolds, Laria , year=. Role play with large language models , volume=. Nature , publisher=. doi:10.1038/s41586-023-06647-8 , number=

work page doi:10.1038/s41586-023-06647-8

[27] [27]

Slocum, Stewart and Minder, Julian and Dumas, Cl. Believe. doi:10.48550/arXiv.2510.17941 , urldate =. arXiv , keywords =:2510.17941 , primaryclass =

work page doi:10.48550/arxiv.2510.17941

[28] [28]

Difficulties with

Smith, Lewis and Chughtai, Bilal and Nanda, Neel , year = 2025, month = nov, journal =. Difficulties with

2025

[29] [29]

Inoculation

Tan, Daniel and Woodruff, Anders and Warncke, Niels and Jose, Arun and Rich. Inoculation. 2510.04340 , primaryclass =

arXiv

[30] [30]

Alignment

Tice, Cameron and Radmard, Puria and Ratnam, Samuel and Kim, Andy and Africa, David and O'Brien, Kyle , year = 2026, month = jan, journal =. Alignment

2026

[31] [31]

ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

Model Organisms for Emergent Misalignment , author=. ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

2025

[32] [32]

In-Context Learning Alone Can Induce Weird Generalization , author =

[33] [33]

Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and Zheng, Chujie and Liu, Dayiheng and Zhou, Fan and Huang, Fei and Hu, Feng and Ge, Hao and Wei, Haoran and Lin, Huan and Tang, Jialong and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jia...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025

[34] [34]

2602.20273 , primaryclass =

The Truthfulness Spectrum Hypothesis , author =. 2602.20273 , primaryclass =

arXiv

[35] [35]

Proceedings of the 41st International Conference on Machine Learning , pages =

Language Models Represent Beliefs of Self and Others , author =. Proceedings of the 41st International Conference on Machine Learning , pages =. 2024 , editor =

2024

[36] [36]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and Goel, Shashwat and Li, Nathaniel and Byun, Michael J. and Wang, Zifan and Mallen, Alex and Basart, Steven and Koyejo, Sanmi and Song, Dawn and Fredrikson, Matt and Kolter, J. ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.01405 2025

[37] [37]

No Answer Needed: Predicting

Moreno Cencerrado, Iv. No Answer Needed: Predicting. 2025 , eprint =

2025