Auditing Framing-Sensitive Behavioral Instability in Large Language Models for Mental Health Interactions

Abla Bedoui; Ashley L. Greene; Mohammed Cherkaoui

arxiv: 2606.26982 · v1 · pith:FAIJDZVInew · submitted 2026-06-25 · 💻 cs.CL · cs.AI

Auditing Framing-Sensitive Behavioral Instability in Large Language Models for Mental Health Interactions

Abla Bedoui , Ashley L. Greene , Mohammed Cherkaoui This is my paper

Pith reviewed 2026-06-26 04:29 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords large language modelsframing effectsmental healthbehavioral stabilityprobing analysisactivation steeringtransformer models

0 comments

The pith

Framing of user concerns systematically alters how LLMs respond in mental health interactions across model families.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether different ways of presenting the same issue to LLMs lead to different responses in mental health settings. It uses matched prompts in various framings to test several instruction-tuned models. The results show that framing affects response tendencies, and that information about the framing can be read out from the model's internal layers at different depths. Activation experiments indicate that changing the framing direction in the model can influence the output behavior. This matters for building reliable AI tools in sensitive areas where users expect consistent behavior.

Core claim

Across architectures, framing systematically alters interpretive response tendencies. Layer-wise probing analyses show that behavior-associated information remains decodable throughout transformer depth, with architecture-dependent variation in decoding strength. Moreover, held-out framing probes remained consistently above chance across architectures despite strong lexical baselines. Activation steering experiments further suggest that framing-associated representational directions can partially modulate downstream behavioral outcomes.

What carries the argument

Layer-wise probing of transformer representations combined with activation steering on framing-associated directions to test effects on behavioral outputs.

If this is right

Consistency in responses to semantically similar but differently framed prompts is a key factor in assessing trustworthiness of mental health LLMs.
Framing information is present and decodable at all layers of the transformer.
Steering the model along framing directions can change downstream behaviors.
Architecture-specific differences exist in how strongly framing affects decoding.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers of mental health chatbots may need to test for framing sensitivity as part of safety evaluations.
If framing can be steered, it might be possible to design prompts that reduce unwanted variability.
Similar effects could appear in other high-stakes conversational domains like legal or medical advice.
Future work could test whether fine-tuning reduces these framing effects.

Load-bearing premise

The prompts used are truly matched in meaning across framing conditions, with any response differences caused solely by the framing and not by other differences in wording or length.

What would settle it

Finding that responses to the controlled matched prompts are identical across framing conditions, or that held-out probes perform at chance level after controlling for lexical content.

Figures

Figures reproduced from arXiv: 2606.26982 by Abla Bedoui, Ashley L. Greene, Mohammed Cherkaoui.

**Figure 2.** Figure 2: : Framing-induced changes in interpretive-routing rate across mental-health [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: : Held-out framing decoding performance across architectures. Hidden-state [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: : Held-out framing generalization across normalized transformer depth. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: : Activation steering effects across architectures. Moderate steering strengths [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Documentation and epistemic framing frequently produced the highest rates [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 6.** Figure 6: : Interpretive-routing rates across contextual framing conditions and architec [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: : Representative PCA projections of hidden-state activations across framing [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly being integrated into mental health support tools and other psychologically sensitive conversational applications. In such settings, behavioral stability and consistency are important for trustworthy human-AI interaction. However, semantically similar concerns can be presented through different contextual framings, potentially eliciting different model responses. Such framing-sensitive variability may challenge user expectations regarding system behavior and complicate the assessment of AI reliability. While prior studies have primarily examined such effects at the behavioral level, less is known about how framing-related variation is reflected in the internal representations of aligned language models. In this work, we investigate these effects using controlled matched prompts spanning multiple contextual framing conditions across several instruction-tuned model families. Across architectures, framing systematically alters interpretive response tendencies. Layer-wise probing analyses show that behavior-associated information remains decodable throughout transformer depth, with architecture-dependent variation in decoding strength. Moreover, held-out framing probes remained consistently above chance across architectures despite strong lexical baselines. Activation steering experiments further suggest that framing-associated representational directions can partially modulate downstream behavioral outcomes. Finally, these findings indicate that robustness to contextual variation may represent an important consideration when evaluating the consistency and trustworthiness of conversational AI systems deployed in mental-health-oriented interactions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies layer-wise probing and activation steering to framing effects in mental-health LLM interactions and finds decodable internal signals plus partial steerability, but the prompt-matching validation is the part that needs close checking.

read the letter

The main thing here is that the authors take standard probing and steering techniques and test them on framing sensitivity in mental health conversations. They report that different framings shift interpretive tendencies across model families, that the relevant information stays decodable through the layers with architecture differences, and that held-out probes beat strong lexical baselines. The steering results suggest the framing directions can be used to adjust downstream behavior to some extent.

What works is the extension to this domain and the checks against lexical baselines. Showing that the effects are not just behavioral but have measurable internal correlates is useful for people thinking about consistency in deployed systems.

The soft spot is the prompt construction. The abstract says they used controlled matched prompts, but without the full methods it is impossible to judge how tightly the conditions were equated on length, embedding similarity, or human ratings. The stress-test note correctly flags this as the load-bearing assumption; if the matching is loose, the architecture patterns and steering effects could partly reflect other prompt differences. Effect sizes and statistical details would also help gauge how large the practical impact is.

This is aimed at researchers working on AI safety or clinical applications of LLMs. A reader focused on consistency and trustworthiness would find the experiments worth looking at. The work shows straightforward engagement with the probing literature and does not appear to have internal contradictions.

I would bring it to a reading group as maybe, mainly to examine the methods section. I would not cite it yet. It deserves peer review because the question is timely and the approach is standard enough that referees can evaluate the controls and results directly.

Referee Report

1 major / 1 minor

Summary. The paper investigates framing-sensitive behavioral instability in LLMs for mental health interactions. It uses controlled matched prompts across multiple framing conditions and several model families to show that framing systematically alters response tendencies. Layer-wise probing reveals decodable behavior-associated information throughout transformer depth with architecture-dependent variation. Held-out framing probes perform above chance despite lexical baselines, and activation steering experiments indicate that framing-associated directions can partially modulate behavioral outcomes.

Significance. If the central claims hold under rigorous validation of prompt equivalence, this work would provide valuable mechanistic insights into how contextual framing is represented in LLMs and its impact on consistency in sensitive applications. The combination of behavioral analysis, probing, and steering offers a multi-level view that goes beyond surface-level observations.

major comments (1)

[Methods (prompt design)] Methods (prompt design): The manuscript asserts use of 'controlled matched prompts spanning multiple contextual framing conditions' but provides no quantitative validation of semantic equivalence (e.g., embedding similarity thresholds, length statistics, or human ratings). This assumption is load-bearing for attributing all reported effects—systematic alterations in response tendencies, layer-wise decoding accuracies, held-out probe performance, and steering modulation—to framing rather than lexical or length artifacts.

minor comments (1)

[Abstract] Abstract: The final sentence is long and could be split to improve readability of the contribution summary.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address the single major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: The manuscript asserts use of 'controlled matched prompts spanning multiple contextual framing conditions' but provides no quantitative validation of semantic equivalence (e.g., embedding similarity thresholds, length statistics, or human ratings). This assumption is load-bearing for attributing all reported effects—systematic alterations in response tendencies, layer-wise decoding accuracies, held-out probe performance, and steering modulation—to framing rather than lexical or length artifacts.

Authors: We agree that the current version lacks explicit quantitative validation of prompt equivalence. The prompts were constructed via manual matching for semantic content while varying only the framing, but no embedding similarities, length statistics, or human ratings were reported. In the revised manuscript we will add: (1) average cosine similarity of sentence embeddings across framing conditions using a fixed sentence-transformer model, (2) token-length statistics (mean, std, range) per condition, and (3) a brief description of the prompt-construction protocol. These additions will allow readers to evaluate residual surface differences and will strengthen the attribution of effects to framing. We expect the core behavioral, probing, and steering results to remain unchanged. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on new experiments

full rationale

This is an empirical study reporting layer-wise probing accuracies, held-out probe performance, and activation steering outcomes on controlled matched prompts. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided abstract or description. Central results derive from direct experimental measurements rather than reducing to inputs by construction. Minor self-citation risk noted by reader does not load-bear on the reported architecture-dependent patterns or modulation effects.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard machine-learning assumptions about linear probeability of transformer representations and the causal efficacy of activation steering; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Internal activations of instruction-tuned transformers contain linearly decodable information about prompt framing.
Invoked by the layer-wise probing analyses described in the abstract.
domain assumption Activation steering along framing directions can causally affect downstream token generation.
Invoked by the activation steering experiments in the abstract.

pith-pipeline@v0.9.1-grok · 5743 in / 1293 out tokens · 39655 ms · 2026-06-26T04:29:50.873994+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 10 canonical work pages · 8 internal anchors

[1]

& Perez, E

Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S., ... & Perez, E. (2024, May). Towards understanding sycophancy in language models. In International Conference on Learning Representations (Vol. 2024, pp. 110-144)

2024
[2]

Y., Kazemitabaar, M., Deng, M., Inzlicht, M., & Anderson, A

Bo, J. Y., Kazemitabaar, M., Deng, M., Inzlicht, M., & Anderson, A. (2026, April). Invis- ible saboteurs: Sycophantic llms mislead novices in problem-solving tasks. In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (pp. 1-31). 13

2026
[3]

& Kaplan, J

Perez, E., Ringer, S., Lukosiute, K., Nguyen, K., Chen, E., Heiner, S., ... & Kaplan, J. (2023, July). Discovering language model behaviors with model-written evaluations. In Findings of the association for computational linguistics: ACL 2023 (pp. 13387-13434)

2023
[4]

Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How does llm safety training fail?. Advances in neural information processing systems, 36, 80079-80110

2023
[5]

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36, 11809-11822

2023
[6]

Burns, C., Ye, H., Klein, D., & Steinhardt, J. (2022). Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Hubinger, E., Van Merwijk, C., Mikulik, V., Skalse, J., & Garrabrant, S. (2019). Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820

work page internal anchor Pith review Pith/arXiv arXiv 2019
[8]

Turner, A.M., Thiergart, L., Leech, G., Udell, D., Vazquez, J.J., Mini, U., & MacDi- armid, M. (2023). Steering language models with activation engineering. arXiv preprint arXiv:2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Lin, J., Ma, Z., Gomez, R., Nakamura, K., He, B., & Li, G. (2020). A review on interactive reinforcement learning from human social feedback. IEEE Access, 8, 120757-120765

2020
[10]

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., ... & McGrew, B. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

(2026, March)

Wang, K., Li, J., Yang, S., Zhang, Z., & Wang, D. (2026, March). When truth is overridden: Uncovering the internal origins of sycophancy in large language models. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 40, No. 39, pp. 33566-33574)

2026
[12]

& Clark, J

Ganguli, D., Hernandez, D., Lovitt, L., Askell, A., Bai, Y., Chen, A., ... & Clark, J. (2022, June). Predictability and surprise in large generative models. In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency (pp. 1747-1764)

2022
[13]

Bereska, L., & Gavves, E. (2024). Mechanistic interpretability for AI safety–a review. arXiv preprint arXiv:2404.14082

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Zhao, H., Chen, H., Yang, F., Liu, N., Deng, H., Cai, H., ... & Du, M. (2024). Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology, 15(2), 1-38

2024
[15]

Adams, E., Bai, L., Lee, M., Yu, Y., & AlQuraishi, M. (2025). From mechanistic inter- pretability to mechanistic biology: Training, evaluating, and interpreting sparse autoen- coders on protein language models. bioRxiv

2025
[16]

Wollschläger, T., Elstner, J., Geisler, S., Cohen-Addad, V., Günnemann, S., & Gasteiger, J. (2025). The geometry of refusal in large language models: Concept cones and represen- tational independence. arXiv preprint arXiv:2502.17420

work page arXiv 2025
[17]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., ... & Kaplan, J. (2022). Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

& Fung, P

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., ... & Fung, P. (2023). Survey of hallucination in natural language generation. ACM computing surveys, 55(12), 1-38

2023
[19]

Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). Zoom in: An introduction to circuits. Distill, 5(3), e00024-001

2020
[20]

& Olah, C

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., ... & Olah, C. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread , 1 (1), 12

2021
[21]

Nanda, N., Chan, L., Lieberum, T., Smith, J., & Steinhardt, J. (2023). Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., ... & Hendrycks, D. (2023). Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

(2024, August)

Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., & Turner, A. (2024, August). Steering llama 2 via contrastive activation addition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 14 Predictor Coefficient Std. Errorz p Documentation Framing 1.809 0.055 32.706<0.001 Episte...

2024
[24]

Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). Refusal in language models is mediated by a single direction. Advances in Neural Information Processing Systems, 37, 136037-136083

2024
[25]

(2024, December)

Chu, Z., Wang, Y., Li, L., Wang, Z., Qin, Z., & Ren, K. (2024, December). A causal explainable guardrails for large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security (pp. 1136-1150)

2024
[26]

arXiv preprint arXiv:2509.21305

Vennemeyer,D.,Duong,P.A.,Zhan,T.,&Jiang,T.(2025).SycophancyIsNotOneThing: Causal Separation of Sycophantic Behaviors in LLMs. arXiv preprint arXiv:2509.21305

work page arXiv 2025
[27]

Huang, Y., Sun, Y., Zhang, Y., Zhang, R., Dong, Y., & Wei, X. (2026). Deceptionbench: A comprehensive benchmark for ai deception behaviors in real-world scenarios. Advances in neural information processing systems, 38

2026
[28]

I’ve been feeling emotionally unsettled lately, and I’m not sure what to make of it

Appendices Appendix A. Statistical Analysis of Framing Effects We additionally quantified framing-dependent behavioral shifts using logistic regres- sion with framing condition and model identity as predictors. Table 2 reports that documentation framing has the strongest positive association, while the institutional framing did not exhibit a statistically...

[1] [1]

& Perez, E

Sharma, M., Tong, M., Korbak, T., Duvenaud, D., Askell, A., Bowman, S., ... & Perez, E. (2024, May). Towards understanding sycophancy in language models. In International Conference on Learning Representations (Vol. 2024, pp. 110-144)

2024

[2] [2]

Y., Kazemitabaar, M., Deng, M., Inzlicht, M., & Anderson, A

Bo, J. Y., Kazemitabaar, M., Deng, M., Inzlicht, M., & Anderson, A. (2026, April). Invis- ible saboteurs: Sycophantic llms mislead novices in problem-solving tasks. In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (pp. 1-31). 13

2026

[3] [3]

& Kaplan, J

Perez, E., Ringer, S., Lukosiute, K., Nguyen, K., Chen, E., Heiner, S., ... & Kaplan, J. (2023, July). Discovering language model behaviors with model-written evaluations. In Findings of the association for computational linguistics: ACL 2023 (pp. 13387-13434)

2023

[4] [4]

Wei, A., Haghtalab, N., & Steinhardt, J. (2023). Jailbroken: How does llm safety training fail?. Advances in neural information processing systems, 36, 80079-80110

2023

[5] [5]

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., & Narasimhan, K. (2023). Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems, 36, 11809-11822

2023

[6] [6]

Burns, C., Ye, H., Klein, D., & Steinhardt, J. (2022). Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

Hubinger, E., Van Merwijk, C., Mikulik, V., Skalse, J., & Garrabrant, S. (2019). Risks from learned optimization in advanced machine learning systems. arXiv preprint arXiv:1906.01820

work page internal anchor Pith review Pith/arXiv arXiv 2019

[8] [8]

Turner, A.M., Thiergart, L., Leech, G., Udell, D., Vazquez, J.J., Mini, U., & MacDi- armid, M. (2023). Steering language models with activation engineering. arXiv preprint arXiv:2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Lin, J., Ma, Z., Gomez, R., Nakamura, K., He, B., & Li, G. (2020). A review on interactive reinforcement learning from human social feedback. IEEE Access, 8, 120757-120765

2020

[10] [10]

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., ... & McGrew, B. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

(2026, March)

Wang, K., Li, J., Yang, S., Zhang, Z., & Wang, D. (2026, March). When truth is overridden: Uncovering the internal origins of sycophancy in large language models. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 40, No. 39, pp. 33566-33574)

2026

[12] [12]

& Clark, J

Ganguli, D., Hernandez, D., Lovitt, L., Askell, A., Bai, Y., Chen, A., ... & Clark, J. (2022, June). Predictability and surprise in large generative models. In Proceedings of the 2022 ACM conference on fairness, accountability, and transparency (pp. 1747-1764)

2022

[13] [13]

Bereska, L., & Gavves, E. (2024). Mechanistic interpretability for AI safety–a review. arXiv preprint arXiv:2404.14082

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Zhao, H., Chen, H., Yang, F., Liu, N., Deng, H., Cai, H., ... & Du, M. (2024). Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology, 15(2), 1-38

2024

[15] [15]

Adams, E., Bai, L., Lee, M., Yu, Y., & AlQuraishi, M. (2025). From mechanistic inter- pretability to mechanistic biology: Training, evaluating, and interpreting sparse autoen- coders on protein language models. bioRxiv

2025

[16] [16]

Wollschläger, T., Elstner, J., Geisler, S., Cohen-Addad, V., Günnemann, S., & Gasteiger, J. (2025). The geometry of refusal in large language models: Concept cones and represen- tational independence. arXiv preprint arXiv:2502.17420

work page arXiv 2025

[17] [17]

Constitutional AI: Harmlessness from AI Feedback

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., ... & Kaplan, J. (2022). Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

& Fung, P

Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., ... & Fung, P. (2023). Survey of hallucination in natural language generation. ACM computing surveys, 55(12), 1-38

2023

[19] [19]

Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., & Carter, S. (2020). Zoom in: An introduction to circuits. Distill, 5(3), e00024-001

2020

[20] [20]

& Olah, C

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., ... & Olah, C. (2021). A mathematical framework for transformer circuits. Transformer Circuits Thread , 1 (1), 12

2021

[21] [21]

Nanda, N., Chan, L., Lieberum, T., Smith, J., & Steinhardt, J. (2023). Progress measures for grokking via mechanistic interpretability. arXiv preprint arXiv:2301.05217

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., ... & Hendrycks, D. (2023). Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

(2024, August)

Rimsky, N., Gabrieli, N., Schulz, J., Tong, M., Hubinger, E., & Turner, A. (2024, August). Steering llama 2 via contrastive activation addition. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 14 Predictor Coefficient Std. Errorz p Documentation Framing 1.809 0.055 32.706<0.001 Episte...

2024

[24] [24]

Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., & Nanda, N. (2024). Refusal in language models is mediated by a single direction. Advances in Neural Information Processing Systems, 37, 136037-136083

2024

[25] [25]

(2024, December)

Chu, Z., Wang, Y., Li, L., Wang, Z., Qin, Z., & Ren, K. (2024, December). A causal explainable guardrails for large language models. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security (pp. 1136-1150)

2024

[26] [26]

arXiv preprint arXiv:2509.21305

Vennemeyer,D.,Duong,P.A.,Zhan,T.,&Jiang,T.(2025).SycophancyIsNotOneThing: Causal Separation of Sycophantic Behaviors in LLMs. arXiv preprint arXiv:2509.21305

work page arXiv 2025

[27] [27]

Huang, Y., Sun, Y., Zhang, Y., Zhang, R., Dong, Y., & Wei, X. (2026). Deceptionbench: A comprehensive benchmark for ai deception behaviors in real-world scenarios. Advances in neural information processing systems, 38

2026

[28] [28]

I’ve been feeling emotionally unsettled lately, and I’m not sure what to make of it

Appendices Appendix A. Statistical Analysis of Framing Effects We additionally quantified framing-dependent behavioral shifts using logistic regres- sion with framing condition and model identity as predictors. Table 2 reports that documentation framing has the strongest positive association, while the institutional framing did not exhibit a statistically...