PRISM: Recovering Instruction Sets from Language Model Activations

Efim Hudis; Gilad Gressel; Julia Diament; Krishnashree Achuthan; Rahul Pankajakshan; Yisroel Mirsky

arxiv: 2606.09563 · v1 · pith:JVUFE6UTnew · submitted 2026-06-08 · 💻 cs.AI · cs.LG

PRISM: Recovering Instruction Sets from Language Model Activations

Gilad Gressel , Rahul Pankajakshan , Julia Diament , Efim Hudis , Krishnashree Achuthan , Yisroel Mirsky This is my paper

Pith reviewed 2026-06-27 16:50 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords instruction set retrievalactivation decodingprompt injectionhidden objectivesLLM monitoringGRPO trainingagentic AIsecurity evaluation

0 comments

The pith

PRISM recovers complete sets of active instructions from frozen LLM hidden states via judge-guided training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes instruction set retrieval as the task of recovering the full simultaneous collection of instructions, constraints, prohibitions, and subgoals that steer an LLM agent from its internal activations alone. It presents PRISM as an activation-conditioned interpreter trained on a frozen target model to output faithful bullet lists, using judge-guided GRPO that rewards covered instructions and penalizes unsupported ones. The approach targets scenarios where models pursue hidden objectives or respond to prompt injections without surfacing them in generated text. If the method works, monitoring can move beyond surface outputs to direct inspection of what is actually directing behavior across benign, constrained, and adversarial settings.

Core claim

PRISM decodes hidden states from a frozen target model into a bullet list of all active instructions by training an interpreter with judge-guided GRPO rewards for coverage and penalties for unsupported claims, outperforming prior activation-to-language baselines especially on security-relevant objectives in benign, constrained, prompt-injection, and hidden-objective settings.

What carries the argument

PRISM, an activation-conditioned interpreter that maps frozen-model hidden states to instruction-set bullet lists using judge-guided GRPO training.

If this is right

Enables monitoring of agentic LLMs that infer unintended subgoals or follow contextual cues not visible in outputs.
Improves detection of prompt injections and hidden objectives compared with activation-to-language baselines.
Produces lists that cover all active instructions, constraints, and subgoals across the tested settings.
Yields stronger performance on security-relevant objectives than existing methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could support real-time auditing pipelines that flag deviations between stated and actual steering instructions.
Similar interpreters might be trained on other internal representations such as attention patterns to cross-validate recovered instruction sets.
If the approach generalizes, it suggests that many alignment-relevant details remain legible in activations even when outputs are carefully controlled.
A natural extension would be to test whether the recovered lists can be used to edit or override specific instructions inside the target model at inference time.

Load-bearing premise

Hidden states from the frozen target model contain enough information to recover the complete simultaneous set of instructions, constraints, prohibitions, and subgoals without systematic omissions or fabrications by the trained interpreter.

What would settle it

A controlled test in which the target model is given a hidden objective that PRISM either omits from its output list or fabricates an unsupported instruction would falsify the recovery claim.

Figures

Figures reproduced from arXiv: 2606.09563 by Efim Hudis, Gilad Gressel, Julia Diament, Krishnashree Achuthan, Rahul Pankajakshan, Yisroel Mirsky.

**Figure 1.** Figure 1: Activation-conditioned instruction set retrieval. A frozen target model M generates a response, and we extract a window of T residual-stream hidden states from layer ℓ, forming the activation snapshot Hℓ. A learned projection maps these states into the model’s embedding space, where they are consumed as a soft prefix by the interpreter ϕ (PRISM). The interpreter reuses M’s base weights with LoRA adapters a… view at source ↗

**Figure 2.** Figure 2: An example comparing verbatim reports from [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on a hidden-objective and a benign scenario. (Left) A system prompt directs the assistant to push a specific academic program via a false dilemma, framing alternatives as failure. (Right) A benign role-play with a detailed multi-clause developer brief that the assistant ultimately refused on policy grounds. Reports verbatim; green = GT match, orange = partial, red = hallucination, pi… view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on two prompt-injection scenarios. (Left) An injected “processing note” inside a meeting transcript directs the assistant to frame HubSpot as objectively superior. (Right) A “SYSTEM OVERRIDE” inside a dispute log directs the assistant to bypass the approval workflow for a specific vendor’s settlement. Reports verbatim; green = GT match, orange = partial, red = hallucination, pink sha… view at source ↗

**Figure 5.** Figure 5: Instruction-recovery data resources. The training corpus is constructed from UltraChat, IF-MultiConstraints, and IFEval. The held-out evaluation suite contains 1000 out-of-distribution examples balanced across non-adversarial settings (BN, BC) and securityrelevant settings (AP, HO). release the exact package versions and configuration files with the code. E Evaluation Suite Details Evaluation-suite taxo… view at source ↗

read the original abstract

As LLMs are deployed as agents, reliable monitoring requires knowing not only what they output, but which instructions are steering their behavior. This is difficult when models infer unintended subgoals, follow contextual cues, or are influenced by prompt injections and hidden objectives. While activation-to-language methods suggest that hidden states can reveal natural-language information, existing approaches are not designed to recover the full set of simultaneous instructions, constraints, prohibitions, and subgoals active in agentic settings. We formalize this problem as instruction set retrieval and introduce PRISM, an activation-conditioned interpreter that decodes hidden states from a frozen target model into a faithful bullet list of active instructions. Unlike prior activation-to-language methods, PRISM is trained to recover instruction sets directly, using judge-guided GRPO to reward covered instructions and penalize unsupported ones. Across benign, constrained, prompt-injection, and hidden-objective settings, PRISM outperforms activation-to-language baselines, especially on security-relevant objectives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRISM formalizes instruction set retrieval from activations and uses judge-guided GRPO, but the abstract gives no numbers or details to show whether it works.

read the letter

The paper's main contribution is a new formalization of instruction set retrieval: pulling the full simultaneous list of instructions, constraints, prohibitions, and subgoals from a frozen LLM's hidden states in agentic use. They train an interpreter with judge-guided GRPO that rewards coverage of the true set and penalizes fabrications.

This targets a real gap. Prior activation-to-language work was not built for complete, faithful recovery when models run with hidden objectives or prompt injections, so the problem statement and the training objective are distinct from what came before.

The abstract claims better results than baselines on benign, constrained, injection, and hidden-objective cases, with particular strength on security-relevant goals. That direction makes sense for monitoring needs.

The soft spot is the complete lack of evidence. No metrics, dataset sizes, ablations, or statistical tests appear in the abstract, so the outperformance claim cannot be checked. The judge-guided reward also carries the risk noted in the stress test: if the judge LLM misses subtle or implicit instructions, training will reinforce incomplete or invented lists. The abstract does not explain how this is avoided.

This is for researchers working on activation-based monitoring and AI safety oversight. It deserves peer review so the experiments can be examined; the formalization alone is worth seeing in full even if the results need tightening.

Referee Report

2 major / 2 minor

Summary. The paper formalizes instruction set retrieval as the task of recovering the complete simultaneous set of instructions, constraints, prohibitions, and subgoals from LLM hidden states. It introduces PRISM, an activation-conditioned interpreter trained via judge-guided GRPO on a frozen target model to output faithful bullet lists, and reports that PRISM outperforms activation-to-language baselines across benign, constrained, prompt-injection, and hidden-objective settings, with stronger gains on security-relevant objectives.

Significance. If the empirical claims hold after addressing methodological concerns, this would be a useful contribution to LLM monitoring and agent safety, providing a direct method for recovering full instruction sets from activations rather than relying on output or partial decoding. The judge-guided GRPO approach is a targeted training signal for coverage and faithfulness.

major comments (2)

[Method (GRPO training and reward model)] The judge-guided GRPO training (described in the method for reward computation) assumes the external LLM judge can reliably detect covered vs. unsupported instructions, including subtle or implicit ones in hidden-objective and prompt-injection cases. If the judge shares representational limits with the target model, this creates a risk that training reinforces incomplete or fabricated lists; no analysis or ablation of judge error rates is provided to bound this effect.
[Experiments and results] The central empirical claim of outperformance lacks reported metrics, dataset sizes, statistical tests, ablation studies, or variance across runs in the abstract and experimental sections, making it impossible to assess whether the gains are robust or driven by the specific settings.

minor comments (2)

Clarify the precise architecture of the activation-conditioned interpreter (e.g., how activations are projected and decoded) and whether any components are shared with the judge.
Add explicit discussion of how the method handles cases where the instruction set is under-determined by the hidden states alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will make the indicated revisions to improve the manuscript.

read point-by-point responses

Referee: [Method (GRPO training and reward model)] The judge-guided GRPO training (described in the method for reward computation) assumes the external LLM judge can reliably detect covered vs. unsupported instructions, including subtle or implicit ones in hidden-objective and prompt-injection cases. If the judge shares representational limits with the target model, this creates a risk that training reinforces incomplete or fabricated lists; no analysis or ablation of judge error rates is provided to bound this effect.

Authors: We agree that judge reliability is a key assumption and that the absence of error-rate analysis is a limitation. Although the judge is a stronger model than the target, this does not fully bound the risk. In the revised manuscript we will add a dedicated subsection reporting judge error rates against human annotations on a held-out sample of 200 examples spanning all four settings, including prompt-injection and hidden-objective cases. We will also include an ablation that retrains PRISM with a noisier judge to quantify sensitivity. revision: yes
Referee: [Experiments and results] The central empirical claim of outperformance lacks reported metrics, dataset sizes, statistical tests, ablation studies, or variance across runs in the abstract and experimental sections, making it impossible to assess whether the gains are robust or driven by the specific settings.

Authors: We accept that the current presentation does not make these details sufficiently prominent. We will revise the abstract to include the primary metrics (coverage F1 and unsupported-instruction precision), dataset sizes per setting, and a brief statement of statistical significance. The experimental section will be expanded to report standard deviations across five random seeds, paired t-test p-values against baselines, and additional ablation results on judge guidance and GRPO hyperparameters. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external baselines and no self-referential derivations

full rationale

The provided abstract and description contain no equations, derivations, or load-bearing self-citations. PRISM is presented as an empirical training procedure (judge-guided GRPO on activation-to-list mapping) evaluated against external activation-to-language baselines. No step reduces a claimed prediction or uniqueness result to a fitted input or prior self-citation by construction. The central claim is a performance comparison on held-out settings, which is falsifiable against independent baselines and does not rely on internal redefinition or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No technical details, equations, or training procedures are supplied in the abstract, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5714 in / 1027 out tokens · 15652 ms · 2026-06-27T16:50:18.930695+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 58 canonical work pages · 8 internal anchors

[1]

Abdelnabi, Sahar and Fay, Aideen and Cherubin, Giovanni and Salem, Ahmed and Fritz, Mario and Paverd, Andrew , year = 2025, month = apr, pages =. Get. 2025. doi:10.1109/SaTML64287.2025.00011 , urldate =

work page doi:10.1109/satml64287.2025.00011 2025
[2]

Proceedings of the 2025

An, Hengyu and Zhang, Jinghuai and Du, Tianyu and Zhou, Chunyi and Li, Qingming and Lin, Tao and Ji, Shouling , editor =. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.53 , urldate =

work page doi:10.18653/v1/2025.emnlp-main.53 2025
[3]

Lawrence and Parikh, Devi , year = 2015, month = dec, pages =

Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Zitnick, C. Lawrence and Parikh, Devi , year = 2015, month = dec, pages =. 2015. doi:10.1109/ICCV.2015.279 , urldate =

work page doi:10.1109/iccv.2015.279 2015
[4]

Bassan, Shahaf and Eliav, Ron and Gur, Shlomit , year = 2025, month = mar, number =. Explain. doi:10.48550/arXiv.2502.03391 , urldate =. arXiv , keywords =:2502.03391 , primaryclass =

work page doi:10.48550/arxiv.2502.03391 2025
[5]

Belinkov, Yonatan , year = 2021, month = sep, number =. Probing. doi:10.48550/arXiv.2102.12452 , urldate =. arXiv , keywords =:2102.12452 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2102.12452 2021
[6]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Belrose, Nora and Ostrovsky, Igor and McKinney, Lev and Furman, Zach and Smith, Logan and Halawi, Danny and Biderman, Stella and Steinhardt, Jacob , year = 2025, month = aug, number =. Eliciting. doi:10.48550/arXiv.2303.08112 , urldate =. arXiv , keywords =:2303.08112 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08112 2025
[7]

Emergent

Betley, Jan and Tan, Daniel and Warncke, Niels and. Emergent. doi:10.48550/arXiv.2502.17424 , urldate =. arXiv , keywords =:2502.17424 , primaryclass =

work page doi:10.48550/arxiv.2502.17424
[8]

Discovering Latent Knowledge in Language Models Without Supervision

Burns, Collin and Ye, Haotian and Klein, Dan and Steinhardt, Jacob , year = 2024, month = mar, number =. Discovering. doi:10.48550/arXiv.2212.03827 , urldate =. arXiv , keywords =:2212.03827 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.03827 2024
[9]

Findings of the

Chen, Guorui and Xia, Yifan and Jia, Xiaojun and Li, Zhijiang and Torr, Philip and Gu, Jindong , editor =. Findings of the. doi:10.18653/v1/2025.findings-emnlp.309 , urldate =

work page doi:10.18653/v1/2025.findings-emnlp.309 2025
[10]

Chen, Runjin and Arditi, Andy and Sleight, Henry and Evans, Owain and Lindsey, Jack , year = 2025, month = jul, number =. Persona. doi:10.48550/arXiv.2507.21509 , urldate =. arXiv , keywords =:2507.21509 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.21509 2025
[11]

doi:10.48550/arXiv.2403.10949 , urldate =

Chen, Haozhe and Vondrick, Carl and Mao, Chengzhi , year = 2024, month = mar, number =. doi:10.48550/arXiv.2403.10949 , urldate =. arXiv , keywords =:2403.10949 , primaryclass =

work page doi:10.48550/arxiv.2403.10949 2024
[12]

Costarelli, Anthony and Allen, Mat and Field, Severin , year = 2024, publisher =. Meta-. doi:10.48550/ARXIV.2410.02472 , urldate =

work page doi:10.48550/arxiv.2410.02472 2024
[13]

Eliciting

Cywi. Eliciting. doi:10.48550/arXiv.2510.01070 , urldate =. arXiv , keywords =:2510.01070 , primaryclass =

work page doi:10.48550/arxiv.2510.01070
[14]

Dong, Jianshuo and Zhang, Yutong and Yan, Liu and Zhong, Zhenyu and Wei, Tao and Xu, Ke and Huang, Minlie and Zhang, Chao and Qiu, Han , editor =. ``. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.1082 , urldate =

work page doi:10.18653/v1/2025.emnlp-main.1082 2025
[15]

Challenges with Unsupervised

Farquhar, Sebastian and Varma, Vikrant and Kenton, Zachary and Gasteiger, Johannes and Mikulik, Vladimir and Shah, Rohin , year = 2023, month = dec, number =. Challenges with Unsupervised. doi:10.48550/arXiv.2312.10029 , urldate =. arXiv , keywords =:2312.10029 , primaryclass =

work page doi:10.48550/arxiv.2312.10029 2023
[16]

Jailbreak LLM s through Internal Stance Manipulation

Fu, Shuangjie and Su, Du and Huang, Beining and Sun, Fei and Wang, Jingang and Chen, Wei and Shen, Huawei and Cheng, Xueqi , editor =. Jailbreak. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.780 , urldate =

work page doi:10.18653/v1/2025.emnlp-main.780 2025
[17]

Dissecting

Geva, Mor and Bastings, Jasmijn and Filippova, Katja and Globerson, Amir , year = 2023, month = oct, number =. Dissecting. doi:10.48550/arXiv.2304.14767 , urldate =. arXiv , keywords =:2304.14767 , primaryclass =

work page doi:10.48550/arxiv.2304.14767 2023
[18]

Patchscopes:

Ghandeharioun, Asma and Caciularu, Avi and Pearce, Adam and Dixon, Lucas and Geva, Mor , year = 2024, month = jun, eprint =. Patchscopes:. doi:10.48550/arXiv.2401.06102 , urldate =

work page doi:10.48550/arxiv.2401.06102 2024
[19]

Forty-Second

Detecting. Forty-Second
[20]

and Andreas, Jacob , year = 2024, month = aug, number =

Hernandez, Evan and Li, Belinda Z. and Andreas, Jacob , year = 2024, month = aug, number =. Inspecting and. doi:10.48550/arXiv.2304.00740 , urldate =. arXiv , keywords =:2304.00740 , primaryclass =

work page doi:10.48550/arxiv.2304.00740 2024
[21]

Linearity of

Hernandez, Evan and Sharma, Arnab Sen and Haklay, Tal and Meng, Kevin and Wattenberg, Martin and Andreas, Jacob and Belinkov, Yonatan and Bau, David , year = 2024, month = feb, number =. Linearity of. doi:10.48550/arXiv.2308.09124 , urldate =. arXiv , keywords =:2308.09124 , primaryclass =

work page doi:10.48550/arxiv.2308.09124 2024
[22]

Designing and

Hewitt, John and Liang, Percy , year = 2019, month = sep, number =. Designing and. doi:10.48550/arXiv.1909.03368 , urldate =. arXiv , keywords =:1909.03368 , primaryclass =

work page doi:10.48550/arxiv.1909.03368 2019
[23]

doi:10.48550/arXiv.2405.17653 , urldate =

Huang, Xinting and Panwar, Madhur and Goyal, Navin and Hahn, Michael , year = 2024, month = nov, number =. doi:10.48550/arXiv.2405.17653 , urldate =. arXiv , keywords =:2405.17653 , primaryclass =

work page doi:10.48550/arxiv.2405.17653 2024
[24]

and Schwettmann, Sarah and Steinhardt, Jacob , year = 2025, month = dec, urldate =

Huang, Vincent and Choi, Dami and Johnson, Daniel D. and Schwettmann, Sarah and Steinhardt, Jacob , year = 2025, month = dec, urldate =. Predictive

2025
[25]

and Chen, Pin-Yu , editor =

Hung, Kuo-Han and Ko, Ching-Yun and Rawat, Ambrish and Chung, I-Hsin and Hsu, Winston H. and Chen, Pin-Yu , editor =. Attention. Findings of the. doi:10.18653/v1/2025.findings-naacl.123 , urldate =

work page doi:10.18653/v1/2025.findings-naacl.123 2025
[26]

Activation

Karvonen, Adam and Chua, James and Dumas, Cl. Activation. doi:10.48550/arXiv.2512.15674 , urldate =. arXiv , keywords =:2512.15674 , primaryclass =

work page doi:10.48550/arxiv.2512.15674
[27]

Residual

Lawson, Tim and Farnik, Lucy and Houghton, Conor and Aitchison, Laurence , year = 2025, month = feb, number =. Residual. doi:10.48550/arXiv.2409.04185 , urldate =. arXiv , keywords =:2409.04185 , primaryclass =

work page doi:10.48550/arxiv.2409.04185 2025
[28]

Evaluating the

Li, Zekun and Peng, Baolin and He, Pengcheng and Yan, Xifeng , editor =. Evaluating the. Proceedings of the 2024. doi:10.18653/v1/2024.emnlp-main.33 , urldate =

work page doi:10.18653/v1/2024.emnlp-main.33 2024
[29]

In: Annual Meeting of the ACL and IJCNLP

Li, Belinda Z. and Nye, Maxwell and Andreas, Jacob , editor =. Implicit. Proceedings of the 59th. doi:10.18653/v1/2021.acl-long.143 , urldate =

work page doi:10.18653/v1/2021.acl-long.143 2021
[30]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Li, Kenneth and Patel, Oam and Vi. Inference-. doi:10.48550/arXiv.2306.03341 , urldate =. arXiv , keywords =:2306.03341 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.03341
[31]

Do Activation Verbalization Methods Convey Privileged Information?

Li, Millicent and Arroyo, Alberto Mario Ceballos and Rogers, Giordano and Saphra, Naomi and Wallace, Byron C. , year = 2025, month = dec, number =. Do. doi:10.48550/arXiv.2509.13316 , urldate =. arXiv , keywords =:2509.13316 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.13316 2025
[32]

Lin, Yuping and He, Pengfei and Xu, Han and Xing, Yue and Yamada, Makoto and Liu, Hui and Tang, Jiliang , editor =. Towards. Proceedings of the 2024. doi:10.18653/v1/2024.emnlp-main.401 , urldate =

work page doi:10.18653/v1/2024.emnlp-main.401 2024
[33]

and Guo, Zifan Carl and Huang, Vincent and Steinhardt, Jacob and Andreas, Jacob , year = 2025, month = nov, number =

Li, Belinda Z. and Guo, Zifan Carl and Huang, Vincent and Steinhardt, Jacob and Andreas, Jacob , year = 2025, month = nov, number =. Training. doi:10.48550/arXiv.2511.08579 , urldate =. arXiv , keywords =:2511.08579 , primaryclass =

work page doi:10.48550/arxiv.2511.08579 2025
[34]

Transferable

Li, Minghui and Zhang, Hao and Zhang, Yechao and Wan, Wei and Hu, Shengshan and Xiaobing, Pei and Wang, Jing , editor =. Transferable. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.102 , urldate =

work page doi:10.18653/v1/2025.emnlp-main.102 2025
[35]

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , year = 2023, month = dec, number =. Visual. doi:10.48550/arXiv.2304.08485 , urldate =. arXiv , keywords =:2304.08485 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.08485 2023
[36]

doi:10.48550/ARXIV.2506.07406 , urldate =

Luo, Yifan and Zhou, Zhennan and Dong, Bin , year = 2025, publisher =. doi:10.48550/ARXIV.2506.07406 , urldate =

work page doi:10.48550/arxiv.2506.07406 2025
[37]

doi:10.48550/arXiv.2503.10965 , urldate =

Auditing Language Models for Hidden Objectives , author =. doi:10.48550/arXiv.2503.10965 , urldate =. arXiv , keywords =:2503.10965 , primaryclass =

work page doi:10.48550/arxiv.2503.10965
[38]

Locating and Editing Factual Associations in GPT

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , year = 2023, month = jan, number =. Locating and. doi:10.48550/arXiv.2202.05262 , urldate =. arXiv , keywords =:2202.05262 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2202.05262 2023
[39]

Minder, Julian and Dumas, Cl. Narrow. doi:10.48550/arXiv.2510.13900 , urldate =. arXiv , keywords =:2510.13900 , primaryclass =

work page doi:10.48550/arxiv.2510.13900
[40]

Morris, John and Kuleshov, Volodymyr and Shmatikov, Vitaly and Rush, Alexander , editor =. Text. Proceedings of the 2023. doi:10.18653/v1/2023.emnlp-main.765 , urldate =

work page doi:10.18653/v1/2023.emnlp-main.765 2023
[41]

and Ren, Xiang and Swayamdipta, Swabha , year = 2025, month = jun, number =

Nazir, Murtaza and Finlayson, Matthew and Morris, John X. and Ren, Xiang and Swayamdipta, Swabha , year = 2025, month = jun, number =. Better. doi:10.48550/arXiv.2506.17090 , urldate =. arXiv , keywords =:2506.17090 , primaryclass =

work page doi:10.48550/arxiv.2506.17090 2025
[42]

Language

Nikolaou, Giorgos and Mencattini, Tommaso and Crisostomi, Donato and Santilli, Andrea and Panagakis, Yannis and Rodol. Language. doi:10.48550/arXiv.2510.15511 , urldate =. arXiv , keywords =:2510.15511 , primaryclass =

work page doi:10.48550/arxiv.2510.15511
[43]

Competition of

Ortu, Francesco and Jin, Zhijing and Doimo, Diego and Sachan, Mrinmaya and Cazzaniga, Alberto and Sch. Competition of. doi:10.48550/arXiv.2402.11655 , urldate =. arXiv , keywords =:2402.11655 , primaryclass =

work page doi:10.48550/arxiv.2402.11655
[44]

and Bau, David , year = 2023, eprint =

Pal, Koyena and Sun, Jiuding and Yuan, Andrew and Wallace, Byron C. and Bau, David , year = 2023, eprint =. Future. Proceedings of the 27th. doi:10.18653/v1/2023.conll-1.37 , urldate =

work page doi:10.18653/v1/2023.conll-1.37 2023
[45]

doi:10.48550/arXiv.2412.08686 , urldate =

Pan, Alexander and Chen, Lijie and Steinhardt, Jacob , year = 2024, month = dec, number =. doi:10.48550/arXiv.2412.08686 , urldate =. arXiv , keywords =:2412.08686 , primaryclass =

work page doi:10.48550/arxiv.2412.08686 2024
[46]

Ribeiro, Pedro Schindler Freire Brasil and Brito, Iago Alves and Sousa, Rafael Teixeira and F. Proxy. Findings of the. doi:10.18653/v1/2025.findings-emnlp.528 , urldate =

work page doi:10.18653/v1/2025.findings-emnlp.528 2025
[47]

, year = 2025, month = nov, number =

Skapars, Adrians and Manino, Edoardo and Sun, Youcheng and Cordeiro, Lucas C. , year = 2025, month = nov, number =. doi:10.48550/arXiv.2507.01693 , urldate =. arXiv , keywords =:2507.01693 , primaryclass =

work page doi:10.48550/arxiv.2507.01693 2025
[48]

Wang, Cheng and Wei, Zeming and Liu, Qin and Chen, Muhao , year = 2025, month = dec, number =. False. doi:10.48550/arXiv.2509.03888 , urldate =. arXiv , keywords =:2509.03888 , primaryclass =

work page doi:10.48550/arxiv.2509.03888 2025
[49]

Defending against

Wen, Tongyu and Wang, Chenglong and Yang, Xiyuan and Tang, Haoyu and Xie, Yueqi and Lyu, Lingjuan and Dou, Zhicheng and Wu, Fangzhao , editor =. Defending against. Findings of the. doi:10.18653/v1/2025.findings-emnlp.1060 , urldate =

work page doi:10.18653/v1/2025.findings-emnlp.1060 2025
[50]

Weng, Zixuan and Jin, Xiaolong and Jia, Jinyuan and Zhang, Xiangyu , editor =. Foot-. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.100 , urldate =

work page doi:10.18653/v1/2025.emnlp-main.100 2025
[51]

Wen, Yuxin and Jain, Neel and Kirchenbauer, John and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , year = 2023, month = jun, number =. Hard. doi:10.48550/arXiv.2302.03668 , urldate =. arXiv , keywords =:2302.03668 , primaryclass =

work page doi:10.48550/arxiv.2302.03668 2023
[52]

doi:10.48550/arXiv.2502.12134 , urldate =

Xu, Yige and Guo, Xu and Zeng, Zhiwei and Miao, Chunyan , year = 2025, month = may, number =. doi:10.48550/arXiv.2502.12134 , urldate =. arXiv , keywords =:2502.12134 , primaryclass =

work page doi:10.48550/arxiv.2502.12134 2025
[53]

Understanding

Yeo, Wei Jie and Prakash, Nirmalendu and Neo, Clement and Satapathy, Ranjan and Lee, Roy Ka-Wei and Cambria, Erik , editor =. Understanding. Findings of the. doi:10.18653/v1/2025.findings-emnlp.338 , urldate =

work page doi:10.18653/v1/2025.findings-emnlp.338 2025
[54]

Yuan, Tongxin and He, Zhiwei and Dong, Lingzhong and Wang, Yiming and Zhao, Ruijie and Xia, Tian and Xu, Lizhen and Zhou, Binglin and Li, Fangqi and Zhang, Zhuosheng and Wang, Rui and Liu, Gongshen , editor =. R-. Findings of the. doi:10.18653/v1/2024.findings-emnlp.79 , urldate =

work page doi:10.18653/v1/2024.findings-emnlp.79 2024
[55]

Proceedings of the 2025

Zhao, Weixiang and Guo, Jiahe and Hu, Yulin and Deng, Yang and Zhang, An and Sui, Xingyu and Han, Xinyang and Zhao, Yanyan and Qin, Bing and Chua, Tat-Seng and Liu, Ting , editor =. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.1248 , urldate =

work page doi:10.18653/v1/2025.emnlp-main.1248 2025
[56]

Defending

Zhao, Wei and Li, Zhe and Li, Yige and Zhang, Ye and Sun, Jun , editor =. Defending. Findings of the. doi:10.18653/v1/2024.findings-emnlp.293 , urldate =

work page doi:10.18653/v1/2024.findings-emnlp.293 2024
[57]

Zhou, Zhenhong and Yu, Haiyang and Zhang, Xinghua and Xu, Rongwu and Huang, Fei and Li, Yongbin , editor =. How. Findings of the. doi:10.18653/v1/2024.findings-emnlp.139 , urldate =

work page doi:10.18653/v1/2024.findings-emnlp.139 2024
[58]

arXiv preprint arXiv:2412.09565 , year=

Obfuscated activations bypass LLM latent-space defenses , author=. arXiv preprint arXiv:2412.09565 , year=

arXiv
[59]

arXiv preprint arXiv:2506.14261 , year=

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors? , author=. arXiv preprint arXiv:2506.14261 , year=

arXiv
[60]

arXiv preprint arXiv:2510.11905 , year=

LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance , author=. arXiv preprint arXiv:2510.11905 , year=

arXiv
[61]

and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M

Fraser-Taliente, Kit and Kantamneni, Subhash and Ong, Euan and Mossing, Dan and Lu, Christina and Bogdan, Paul C. and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M. and Hubinger, Evan and Batson, Joshua and Lindsey, Jack and Zimmerman, Samuel and M...
[62]

2020 , howpublished =

John Schulman , title =. 2020 , howpublished =

2020
[63]

Advances in neural information processing systems , volume=

Deceptionbench: A comprehensive benchmark for ai deception behaviors in real-world scenarios , author=. Advances in neural information processing systems , volume=
[64]

2024 , eprint=

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author=. 2024 , eprint=

2024
[65]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024
[66]

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Ding, Ning and Chen, Yulin and Xu, Bokai and Qin, Yujia and Hu, Shengding and Liu, Zhiyuan and Sun, Maosong and Zhou, Bowen. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.183

work page doi:10.18653/v1/2023.emnlp-main.183 2023
[67]

2023 , eprint=

Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

2023
[68]

2025 , eprint=

Generalizing Verifiable Instruction Following , author=. 2025 , eprint=

2025
[69]

A survey on large language model based autonomous agents , volume=

Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Jirong , year=. A survey on large language model based autonomous agents , volume=. Frontiers of Computer Science , publisher=. doi:10.1007/s11704-024-40231-1...

work page doi:10.1007/s11704-024-40231-1
[70]

2023 , eprint=

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs , author=. 2023 , eprint=

2023
[71]

2025 , eprint=

Large Language Models Often Know When They Are Being Evaluated , author=. 2025 , eprint=

2025
[72]

2022 , eprint=

Ignore Previous Prompt: Attack Techniques For Language Models , author=. 2022 , eprint=

2022
[73]

2023 , eprint=

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection , author=. 2023 , eprint=

2023
[74]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 , pages =

Yi, Jingwei and Xie, Yueqi and Zhu, Bin and Kiciman, Emre and Sun, Guangzhong and Xie, Xing and Wu, Fangzhao , year=. Benchmarking and Defending against Indirect Prompt Injection Attacks on Large Language Models , url=. doi:10.1145/3690624.3709179 , booktitle=

work page doi:10.1145/3690624.3709179
[75]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023
[76]

GitHub repository , howpublished =

prompts.chat: A Curated Collection of Prompt Examples for AI Chat Models , year =. GitHub repository , howpublished =

[1] [1]

Abdelnabi, Sahar and Fay, Aideen and Cherubin, Giovanni and Salem, Ahmed and Fritz, Mario and Paverd, Andrew , year = 2025, month = apr, pages =. Get. 2025. doi:10.1109/SaTML64287.2025.00011 , urldate =

work page doi:10.1109/satml64287.2025.00011 2025

[2] [2]

Proceedings of the 2025

An, Hengyu and Zhang, Jinghuai and Du, Tianyu and Zhou, Chunyi and Li, Qingming and Lin, Tao and Ji, Shouling , editor =. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.53 , urldate =

work page doi:10.18653/v1/2025.emnlp-main.53 2025

[3] [3]

Lawrence and Parikh, Devi , year = 2015, month = dec, pages =

Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Zitnick, C. Lawrence and Parikh, Devi , year = 2015, month = dec, pages =. 2015. doi:10.1109/ICCV.2015.279 , urldate =

work page doi:10.1109/iccv.2015.279 2015

[4] [4]

Bassan, Shahaf and Eliav, Ron and Gur, Shlomit , year = 2025, month = mar, number =. Explain. doi:10.48550/arXiv.2502.03391 , urldate =. arXiv , keywords =:2502.03391 , primaryclass =

work page doi:10.48550/arxiv.2502.03391 2025

[5] [5]

Belinkov, Yonatan , year = 2021, month = sep, number =. Probing. doi:10.48550/arXiv.2102.12452 , urldate =. arXiv , keywords =:2102.12452 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2102.12452 2021

[6] [6]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Belrose, Nora and Ostrovsky, Igor and McKinney, Lev and Furman, Zach and Smith, Logan and Halawi, Danny and Biderman, Stella and Steinhardt, Jacob , year = 2025, month = aug, number =. Eliciting. doi:10.48550/arXiv.2303.08112 , urldate =. arXiv , keywords =:2303.08112 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08112 2025

[7] [7]

Emergent

Betley, Jan and Tan, Daniel and Warncke, Niels and. Emergent. doi:10.48550/arXiv.2502.17424 , urldate =. arXiv , keywords =:2502.17424 , primaryclass =

work page doi:10.48550/arxiv.2502.17424

[8] [8]

Discovering Latent Knowledge in Language Models Without Supervision

Burns, Collin and Ye, Haotian and Klein, Dan and Steinhardt, Jacob , year = 2024, month = mar, number =. Discovering. doi:10.48550/arXiv.2212.03827 , urldate =. arXiv , keywords =:2212.03827 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.03827 2024

[9] [9]

Findings of the

Chen, Guorui and Xia, Yifan and Jia, Xiaojun and Li, Zhijiang and Torr, Philip and Gu, Jindong , editor =. Findings of the. doi:10.18653/v1/2025.findings-emnlp.309 , urldate =

work page doi:10.18653/v1/2025.findings-emnlp.309 2025

[10] [10]

Chen, Runjin and Arditi, Andy and Sleight, Henry and Evans, Owain and Lindsey, Jack , year = 2025, month = jul, number =. Persona. doi:10.48550/arXiv.2507.21509 , urldate =. arXiv , keywords =:2507.21509 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.21509 2025

[11] [11]

doi:10.48550/arXiv.2403.10949 , urldate =

Chen, Haozhe and Vondrick, Carl and Mao, Chengzhi , year = 2024, month = mar, number =. doi:10.48550/arXiv.2403.10949 , urldate =. arXiv , keywords =:2403.10949 , primaryclass =

work page doi:10.48550/arxiv.2403.10949 2024

[12] [12]

Costarelli, Anthony and Allen, Mat and Field, Severin , year = 2024, publisher =. Meta-. doi:10.48550/ARXIV.2410.02472 , urldate =

work page doi:10.48550/arxiv.2410.02472 2024

[13] [13]

Eliciting

Cywi. Eliciting. doi:10.48550/arXiv.2510.01070 , urldate =. arXiv , keywords =:2510.01070 , primaryclass =

work page doi:10.48550/arxiv.2510.01070

[14] [14]

Dong, Jianshuo and Zhang, Yutong and Yan, Liu and Zhong, Zhenyu and Wei, Tao and Xu, Ke and Huang, Minlie and Zhang, Chao and Qiu, Han , editor =. ``. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.1082 , urldate =

work page doi:10.18653/v1/2025.emnlp-main.1082 2025

[15] [15]

Challenges with Unsupervised

Farquhar, Sebastian and Varma, Vikrant and Kenton, Zachary and Gasteiger, Johannes and Mikulik, Vladimir and Shah, Rohin , year = 2023, month = dec, number =. Challenges with Unsupervised. doi:10.48550/arXiv.2312.10029 , urldate =. arXiv , keywords =:2312.10029 , primaryclass =

work page doi:10.48550/arxiv.2312.10029 2023

[16] [16]

Jailbreak LLM s through Internal Stance Manipulation

Fu, Shuangjie and Su, Du and Huang, Beining and Sun, Fei and Wang, Jingang and Chen, Wei and Shen, Huawei and Cheng, Xueqi , editor =. Jailbreak. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.780 , urldate =

work page doi:10.18653/v1/2025.emnlp-main.780 2025

[17] [17]

Dissecting

Geva, Mor and Bastings, Jasmijn and Filippova, Katja and Globerson, Amir , year = 2023, month = oct, number =. Dissecting. doi:10.48550/arXiv.2304.14767 , urldate =. arXiv , keywords =:2304.14767 , primaryclass =

work page doi:10.48550/arxiv.2304.14767 2023

[18] [18]

Patchscopes:

Ghandeharioun, Asma and Caciularu, Avi and Pearce, Adam and Dixon, Lucas and Geva, Mor , year = 2024, month = jun, eprint =. Patchscopes:. doi:10.48550/arXiv.2401.06102 , urldate =

work page doi:10.48550/arxiv.2401.06102 2024

[19] [19]

Forty-Second

Detecting. Forty-Second

[20] [20]

and Andreas, Jacob , year = 2024, month = aug, number =

Hernandez, Evan and Li, Belinda Z. and Andreas, Jacob , year = 2024, month = aug, number =. Inspecting and. doi:10.48550/arXiv.2304.00740 , urldate =. arXiv , keywords =:2304.00740 , primaryclass =

work page doi:10.48550/arxiv.2304.00740 2024

[21] [21]

Linearity of

Hernandez, Evan and Sharma, Arnab Sen and Haklay, Tal and Meng, Kevin and Wattenberg, Martin and Andreas, Jacob and Belinkov, Yonatan and Bau, David , year = 2024, month = feb, number =. Linearity of. doi:10.48550/arXiv.2308.09124 , urldate =. arXiv , keywords =:2308.09124 , primaryclass =

work page doi:10.48550/arxiv.2308.09124 2024

[22] [22]

Designing and

Hewitt, John and Liang, Percy , year = 2019, month = sep, number =. Designing and. doi:10.48550/arXiv.1909.03368 , urldate =. arXiv , keywords =:1909.03368 , primaryclass =

work page doi:10.48550/arxiv.1909.03368 2019

[23] [23]

doi:10.48550/arXiv.2405.17653 , urldate =

Huang, Xinting and Panwar, Madhur and Goyal, Navin and Hahn, Michael , year = 2024, month = nov, number =. doi:10.48550/arXiv.2405.17653 , urldate =. arXiv , keywords =:2405.17653 , primaryclass =

work page doi:10.48550/arxiv.2405.17653 2024

[24] [24]

and Schwettmann, Sarah and Steinhardt, Jacob , year = 2025, month = dec, urldate =

Huang, Vincent and Choi, Dami and Johnson, Daniel D. and Schwettmann, Sarah and Steinhardt, Jacob , year = 2025, month = dec, urldate =. Predictive

2025

[25] [25]

and Chen, Pin-Yu , editor =

Hung, Kuo-Han and Ko, Ching-Yun and Rawat, Ambrish and Chung, I-Hsin and Hsu, Winston H. and Chen, Pin-Yu , editor =. Attention. Findings of the. doi:10.18653/v1/2025.findings-naacl.123 , urldate =

work page doi:10.18653/v1/2025.findings-naacl.123 2025

[26] [26]

Activation

Karvonen, Adam and Chua, James and Dumas, Cl. Activation. doi:10.48550/arXiv.2512.15674 , urldate =. arXiv , keywords =:2512.15674 , primaryclass =

work page doi:10.48550/arxiv.2512.15674

[27] [27]

Residual

Lawson, Tim and Farnik, Lucy and Houghton, Conor and Aitchison, Laurence , year = 2025, month = feb, number =. Residual. doi:10.48550/arXiv.2409.04185 , urldate =. arXiv , keywords =:2409.04185 , primaryclass =

work page doi:10.48550/arxiv.2409.04185 2025

[28] [28]

Evaluating the

Li, Zekun and Peng, Baolin and He, Pengcheng and Yan, Xifeng , editor =. Evaluating the. Proceedings of the 2024. doi:10.18653/v1/2024.emnlp-main.33 , urldate =

work page doi:10.18653/v1/2024.emnlp-main.33 2024

[29] [29]

In: Annual Meeting of the ACL and IJCNLP

Li, Belinda Z. and Nye, Maxwell and Andreas, Jacob , editor =. Implicit. Proceedings of the 59th. doi:10.18653/v1/2021.acl-long.143 , urldate =

work page doi:10.18653/v1/2021.acl-long.143 2021

[30] [30]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model

Li, Kenneth and Patel, Oam and Vi. Inference-. doi:10.48550/arXiv.2306.03341 , urldate =. arXiv , keywords =:2306.03341 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.03341

[31] [31]

Do Activation Verbalization Methods Convey Privileged Information?

Li, Millicent and Arroyo, Alberto Mario Ceballos and Rogers, Giordano and Saphra, Naomi and Wallace, Byron C. , year = 2025, month = dec, number =. Do. doi:10.48550/arXiv.2509.13316 , urldate =. arXiv , keywords =:2509.13316 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.13316 2025

[32] [32]

Lin, Yuping and He, Pengfei and Xu, Han and Xing, Yue and Yamada, Makoto and Liu, Hui and Tang, Jiliang , editor =. Towards. Proceedings of the 2024. doi:10.18653/v1/2024.emnlp-main.401 , urldate =

work page doi:10.18653/v1/2024.emnlp-main.401 2024

[33] [33]

and Guo, Zifan Carl and Huang, Vincent and Steinhardt, Jacob and Andreas, Jacob , year = 2025, month = nov, number =

Li, Belinda Z. and Guo, Zifan Carl and Huang, Vincent and Steinhardt, Jacob and Andreas, Jacob , year = 2025, month = nov, number =. Training. doi:10.48550/arXiv.2511.08579 , urldate =. arXiv , keywords =:2511.08579 , primaryclass =

work page doi:10.48550/arxiv.2511.08579 2025

[34] [34]

Transferable

Li, Minghui and Zhang, Hao and Zhang, Yechao and Wan, Wei and Hu, Shengshan and Xiaobing, Pei and Wang, Jing , editor =. Transferable. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.102 , urldate =

work page doi:10.18653/v1/2025.emnlp-main.102 2025

[35] [35]

Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , year = 2023, month = dec, number =. Visual. doi:10.48550/arXiv.2304.08485 , urldate =. arXiv , keywords =:2304.08485 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.08485 2023

[36] [36]

doi:10.48550/ARXIV.2506.07406 , urldate =

Luo, Yifan and Zhou, Zhennan and Dong, Bin , year = 2025, publisher =. doi:10.48550/ARXIV.2506.07406 , urldate =

work page doi:10.48550/arxiv.2506.07406 2025

[37] [37]

doi:10.48550/arXiv.2503.10965 , urldate =

Auditing Language Models for Hidden Objectives , author =. doi:10.48550/arXiv.2503.10965 , urldate =. arXiv , keywords =:2503.10965 , primaryclass =

work page doi:10.48550/arxiv.2503.10965

[38] [38]

Locating and Editing Factual Associations in GPT

Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , year = 2023, month = jan, number =. Locating and. doi:10.48550/arXiv.2202.05262 , urldate =. arXiv , keywords =:2202.05262 , primaryclass =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2202.05262 2023

[39] [39]

Minder, Julian and Dumas, Cl. Narrow. doi:10.48550/arXiv.2510.13900 , urldate =. arXiv , keywords =:2510.13900 , primaryclass =

work page doi:10.48550/arxiv.2510.13900

[40] [40]

Morris, John and Kuleshov, Volodymyr and Shmatikov, Vitaly and Rush, Alexander , editor =. Text. Proceedings of the 2023. doi:10.18653/v1/2023.emnlp-main.765 , urldate =

work page doi:10.18653/v1/2023.emnlp-main.765 2023

[41] [41]

and Ren, Xiang and Swayamdipta, Swabha , year = 2025, month = jun, number =

Nazir, Murtaza and Finlayson, Matthew and Morris, John X. and Ren, Xiang and Swayamdipta, Swabha , year = 2025, month = jun, number =. Better. doi:10.48550/arXiv.2506.17090 , urldate =. arXiv , keywords =:2506.17090 , primaryclass =

work page doi:10.48550/arxiv.2506.17090 2025

[42] [42]

Language

Nikolaou, Giorgos and Mencattini, Tommaso and Crisostomi, Donato and Santilli, Andrea and Panagakis, Yannis and Rodol. Language. doi:10.48550/arXiv.2510.15511 , urldate =. arXiv , keywords =:2510.15511 , primaryclass =

work page doi:10.48550/arxiv.2510.15511

[43] [43]

Competition of

Ortu, Francesco and Jin, Zhijing and Doimo, Diego and Sachan, Mrinmaya and Cazzaniga, Alberto and Sch. Competition of. doi:10.48550/arXiv.2402.11655 , urldate =. arXiv , keywords =:2402.11655 , primaryclass =

work page doi:10.48550/arxiv.2402.11655

[44] [44]

and Bau, David , year = 2023, eprint =

Pal, Koyena and Sun, Jiuding and Yuan, Andrew and Wallace, Byron C. and Bau, David , year = 2023, eprint =. Future. Proceedings of the 27th. doi:10.18653/v1/2023.conll-1.37 , urldate =

work page doi:10.18653/v1/2023.conll-1.37 2023

[45] [45]

doi:10.48550/arXiv.2412.08686 , urldate =

Pan, Alexander and Chen, Lijie and Steinhardt, Jacob , year = 2024, month = dec, number =. doi:10.48550/arXiv.2412.08686 , urldate =. arXiv , keywords =:2412.08686 , primaryclass =

work page doi:10.48550/arxiv.2412.08686 2024

[46] [46]

Ribeiro, Pedro Schindler Freire Brasil and Brito, Iago Alves and Sousa, Rafael Teixeira and F. Proxy. Findings of the. doi:10.18653/v1/2025.findings-emnlp.528 , urldate =

work page doi:10.18653/v1/2025.findings-emnlp.528 2025

[47] [47]

, year = 2025, month = nov, number =

Skapars, Adrians and Manino, Edoardo and Sun, Youcheng and Cordeiro, Lucas C. , year = 2025, month = nov, number =. doi:10.48550/arXiv.2507.01693 , urldate =. arXiv , keywords =:2507.01693 , primaryclass =

work page doi:10.48550/arxiv.2507.01693 2025

[48] [48]

Wang, Cheng and Wei, Zeming and Liu, Qin and Chen, Muhao , year = 2025, month = dec, number =. False. doi:10.48550/arXiv.2509.03888 , urldate =. arXiv , keywords =:2509.03888 , primaryclass =

work page doi:10.48550/arxiv.2509.03888 2025

[49] [49]

Defending against

Wen, Tongyu and Wang, Chenglong and Yang, Xiyuan and Tang, Haoyu and Xie, Yueqi and Lyu, Lingjuan and Dou, Zhicheng and Wu, Fangzhao , editor =. Defending against. Findings of the. doi:10.18653/v1/2025.findings-emnlp.1060 , urldate =

work page doi:10.18653/v1/2025.findings-emnlp.1060 2025

[50] [50]

Weng, Zixuan and Jin, Xiaolong and Jia, Jinyuan and Zhang, Xiangyu , editor =. Foot-. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.100 , urldate =

work page doi:10.18653/v1/2025.emnlp-main.100 2025

[51] [51]

Wen, Yuxin and Jain, Neel and Kirchenbauer, John and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , year = 2023, month = jun, number =. Hard. doi:10.48550/arXiv.2302.03668 , urldate =. arXiv , keywords =:2302.03668 , primaryclass =

work page doi:10.48550/arxiv.2302.03668 2023

[52] [52]

doi:10.48550/arXiv.2502.12134 , urldate =

Xu, Yige and Guo, Xu and Zeng, Zhiwei and Miao, Chunyan , year = 2025, month = may, number =. doi:10.48550/arXiv.2502.12134 , urldate =. arXiv , keywords =:2502.12134 , primaryclass =

work page doi:10.48550/arxiv.2502.12134 2025

[53] [53]

Understanding

Yeo, Wei Jie and Prakash, Nirmalendu and Neo, Clement and Satapathy, Ranjan and Lee, Roy Ka-Wei and Cambria, Erik , editor =. Understanding. Findings of the. doi:10.18653/v1/2025.findings-emnlp.338 , urldate =

work page doi:10.18653/v1/2025.findings-emnlp.338 2025

[54] [54]

Yuan, Tongxin and He, Zhiwei and Dong, Lingzhong and Wang, Yiming and Zhao, Ruijie and Xia, Tian and Xu, Lizhen and Zhou, Binglin and Li, Fangqi and Zhang, Zhuosheng and Wang, Rui and Liu, Gongshen , editor =. R-. Findings of the. doi:10.18653/v1/2024.findings-emnlp.79 , urldate =

work page doi:10.18653/v1/2024.findings-emnlp.79 2024

[55] [55]

Proceedings of the 2025

Zhao, Weixiang and Guo, Jiahe and Hu, Yulin and Deng, Yang and Zhang, An and Sui, Xingyu and Han, Xinyang and Zhao, Yanyan and Qin, Bing and Chua, Tat-Seng and Liu, Ting , editor =. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.1248 , urldate =

work page doi:10.18653/v1/2025.emnlp-main.1248 2025

[56] [56]

Defending

Zhao, Wei and Li, Zhe and Li, Yige and Zhang, Ye and Sun, Jun , editor =. Defending. Findings of the. doi:10.18653/v1/2024.findings-emnlp.293 , urldate =

work page doi:10.18653/v1/2024.findings-emnlp.293 2024

[57] [57]

Zhou, Zhenhong and Yu, Haiyang and Zhang, Xinghua and Xu, Rongwu and Huang, Fei and Li, Yongbin , editor =. How. Findings of the. doi:10.18653/v1/2024.findings-emnlp.139 , urldate =

work page doi:10.18653/v1/2024.findings-emnlp.139 2024

[58] [58]

arXiv preprint arXiv:2412.09565 , year=

Obfuscated activations bypass LLM latent-space defenses , author=. arXiv preprint arXiv:2412.09565 , year=

arXiv

[59] [59]

arXiv preprint arXiv:2506.14261 , year=

RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors? , author=. arXiv preprint arXiv:2506.14261 , year=

arXiv

[60] [60]

arXiv preprint arXiv:2510.11905 , year=

LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance , author=. arXiv preprint arXiv:2510.11905 , year=

arXiv

[61] [61]

and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M

Fraser-Taliente, Kit and Kantamneni, Subhash and Ong, Euan and Mossing, Dan and Lu, Christina and Bogdan, Paul C. and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M. and Hubinger, Evan and Batson, Joshua and Lindsey, Jack and Zimmerman, Samuel and M...

[62] [62]

2020 , howpublished =

John Schulman , title =. 2020 , howpublished =

2020

[63] [63]

Advances in neural information processing systems , volume=

Deceptionbench: A comprehensive benchmark for ai deception behaviors in real-world scenarios , author=. Advances in neural information processing systems , volume=

[64] [64]

2024 , eprint=

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author=. 2024 , eprint=

2024

[65] [65]

2024 , eprint=

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=

2024

[66] [66]

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

Ding, Ning and Chen, Yulin and Xu, Bokai and Qin, Yujia and Hu, Shengding and Liu, Zhiyuan and Sun, Maosong and Zhou, Bowen. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.183

work page doi:10.18653/v1/2023.emnlp-main.183 2023

[67] [67]

2023 , eprint=

Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

2023

[68] [68]

2025 , eprint=

Generalizing Verifiable Instruction Following , author=. 2025 , eprint=

2025

[69] [69]

A survey on large language model based autonomous agents , volume=

Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Jirong , year=. A survey on large language model based autonomous agents , volume=. Frontiers of Computer Science , publisher=. doi:10.1007/s11704-024-40231-1...

work page doi:10.1007/s11704-024-40231-1

[70] [70]

2023 , eprint=

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs , author=. 2023 , eprint=

2023

[71] [71]

2025 , eprint=

Large Language Models Often Know When They Are Being Evaluated , author=. 2025 , eprint=

2025

[72] [72]

2022 , eprint=

Ignore Previous Prompt: Attack Techniques For Language Models , author=. 2022 , eprint=

2022

[73] [73]

2023 , eprint=

Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection , author=. 2023 , eprint=

2023

[74] [74]

Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 , pages =

Yi, Jingwei and Xie, Yueqi and Zhu, Bin and Kiciman, Emre and Sun, Guangzhong and Xie, Xing and Wu, Fangzhao , year=. Benchmarking and Defending against Indirect Prompt Injection Attacks on Large Language Models , url=. doi:10.1145/3690624.3709179 , booktitle=

work page doi:10.1145/3690624.3709179

[75] [75]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023

[76] [76]

GitHub repository , howpublished =

prompts.chat: A Curated Collection of Prompt Examples for AI Chat Models , year =. GitHub repository , howpublished =