PRISM: Recovering Instruction Sets from Language Model Activations
Pith reviewed 2026-06-27 16:50 UTC · model grok-4.3
The pith
PRISM recovers complete sets of active instructions from frozen LLM hidden states via judge-guided training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PRISM decodes hidden states from a frozen target model into a bullet list of all active instructions by training an interpreter with judge-guided GRPO rewards for coverage and penalties for unsupported claims, outperforming prior activation-to-language baselines especially on security-relevant objectives in benign, constrained, prompt-injection, and hidden-objective settings.
What carries the argument
PRISM, an activation-conditioned interpreter that maps frozen-model hidden states to instruction-set bullet lists using judge-guided GRPO training.
If this is right
- Enables monitoring of agentic LLMs that infer unintended subgoals or follow contextual cues not visible in outputs.
- Improves detection of prompt injections and hidden objectives compared with activation-to-language baselines.
- Produces lists that cover all active instructions, constraints, and subgoals across the tested settings.
- Yields stronger performance on security-relevant objectives than existing methods.
Where Pith is reading between the lines
- The method could support real-time auditing pipelines that flag deviations between stated and actual steering instructions.
- Similar interpreters might be trained on other internal representations such as attention patterns to cross-validate recovered instruction sets.
- If the approach generalizes, it suggests that many alignment-relevant details remain legible in activations even when outputs are carefully controlled.
- A natural extension would be to test whether the recovered lists can be used to edit or override specific instructions inside the target model at inference time.
Load-bearing premise
Hidden states from the frozen target model contain enough information to recover the complete simultaneous set of instructions, constraints, prohibitions, and subgoals without systematic omissions or fabrications by the trained interpreter.
What would settle it
A controlled test in which the target model is given a hidden objective that PRISM either omits from its output list or fabricates an unsupported instruction would falsify the recovery claim.
Figures
read the original abstract
As LLMs are deployed as agents, reliable monitoring requires knowing not only what they output, but which instructions are steering their behavior. This is difficult when models infer unintended subgoals, follow contextual cues, or are influenced by prompt injections and hidden objectives. While activation-to-language methods suggest that hidden states can reveal natural-language information, existing approaches are not designed to recover the full set of simultaneous instructions, constraints, prohibitions, and subgoals active in agentic settings. We formalize this problem as instruction set retrieval and introduce PRISM, an activation-conditioned interpreter that decodes hidden states from a frozen target model into a faithful bullet list of active instructions. Unlike prior activation-to-language methods, PRISM is trained to recover instruction sets directly, using judge-guided GRPO to reward covered instructions and penalize unsupported ones. Across benign, constrained, prompt-injection, and hidden-objective settings, PRISM outperforms activation-to-language baselines, especially on security-relevant objectives.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper formalizes instruction set retrieval as the task of recovering the complete simultaneous set of instructions, constraints, prohibitions, and subgoals from LLM hidden states. It introduces PRISM, an activation-conditioned interpreter trained via judge-guided GRPO on a frozen target model to output faithful bullet lists, and reports that PRISM outperforms activation-to-language baselines across benign, constrained, prompt-injection, and hidden-objective settings, with stronger gains on security-relevant objectives.
Significance. If the empirical claims hold after addressing methodological concerns, this would be a useful contribution to LLM monitoring and agent safety, providing a direct method for recovering full instruction sets from activations rather than relying on output or partial decoding. The judge-guided GRPO approach is a targeted training signal for coverage and faithfulness.
major comments (2)
- [Method (GRPO training and reward model)] The judge-guided GRPO training (described in the method for reward computation) assumes the external LLM judge can reliably detect covered vs. unsupported instructions, including subtle or implicit ones in hidden-objective and prompt-injection cases. If the judge shares representational limits with the target model, this creates a risk that training reinforces incomplete or fabricated lists; no analysis or ablation of judge error rates is provided to bound this effect.
- [Experiments and results] The central empirical claim of outperformance lacks reported metrics, dataset sizes, statistical tests, ablation studies, or variance across runs in the abstract and experimental sections, making it impossible to assess whether the gains are robust or driven by the specific settings.
minor comments (2)
- Clarify the precise architecture of the activation-conditioned interpreter (e.g., how activations are projected and decoded) and whether any components are shared with the judge.
- Add explicit discussion of how the method handles cases where the instruction set is under-determined by the hidden states alone.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will make the indicated revisions to improve the manuscript.
read point-by-point responses
-
Referee: [Method (GRPO training and reward model)] The judge-guided GRPO training (described in the method for reward computation) assumes the external LLM judge can reliably detect covered vs. unsupported instructions, including subtle or implicit ones in hidden-objective and prompt-injection cases. If the judge shares representational limits with the target model, this creates a risk that training reinforces incomplete or fabricated lists; no analysis or ablation of judge error rates is provided to bound this effect.
Authors: We agree that judge reliability is a key assumption and that the absence of error-rate analysis is a limitation. Although the judge is a stronger model than the target, this does not fully bound the risk. In the revised manuscript we will add a dedicated subsection reporting judge error rates against human annotations on a held-out sample of 200 examples spanning all four settings, including prompt-injection and hidden-objective cases. We will also include an ablation that retrains PRISM with a noisier judge to quantify sensitivity. revision: yes
-
Referee: [Experiments and results] The central empirical claim of outperformance lacks reported metrics, dataset sizes, statistical tests, ablation studies, or variance across runs in the abstract and experimental sections, making it impossible to assess whether the gains are robust or driven by the specific settings.
Authors: We accept that the current presentation does not make these details sufficiently prominent. We will revise the abstract to include the primary metrics (coverage F1 and unsupported-instruction precision), dataset sizes per setting, and a brief statement of statistical significance. The experimental section will be expanded to report standard deviations across five random seeds, paired t-test p-values against baselines, and additional ablation results on judge guidance and GRPO hyperparameters. revision: yes
Circularity Check
No circularity: empirical method with external baselines and no self-referential derivations
full rationale
The provided abstract and description contain no equations, derivations, or load-bearing self-citations. PRISM is presented as an empirical training procedure (judge-guided GRPO on activation-to-list mapping) evaluated against external activation-to-language baselines. No step reduces a claimed prediction or uniqueness result to a fitted input or prior self-citation by construction. The central claim is a performance comparison on held-out settings, which is falsifiable against independent baselines and does not rely on internal redefinition or renaming of known results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Abdelnabi, Sahar and Fay, Aideen and Cherubin, Giovanni and Salem, Ahmed and Fritz, Mario and Paverd, Andrew , year = 2025, month = apr, pages =. Get. 2025. doi:10.1109/SaTML64287.2025.00011 , urldate =
-
[2]
An, Hengyu and Zhang, Jinghuai and Du, Tianyu and Zhou, Chunyi and Li, Qingming and Lin, Tao and Ji, Shouling , editor =. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.53 , urldate =
-
[3]
Lawrence and Parikh, Devi , year = 2015, month = dec, pages =
Antol, Stanislaw and Agrawal, Aishwarya and Lu, Jiasen and Mitchell, Margaret and Batra, Dhruv and Zitnick, C. Lawrence and Parikh, Devi , year = 2015, month = dec, pages =. 2015. doi:10.1109/ICCV.2015.279 , urldate =
-
[4]
Bassan, Shahaf and Eliav, Ron and Gur, Shlomit , year = 2025, month = mar, number =. Explain. doi:10.48550/arXiv.2502.03391 , urldate =. arXiv , keywords =:2502.03391 , primaryclass =
-
[5]
Belinkov, Yonatan , year = 2021, month = sep, number =. Probing. doi:10.48550/arXiv.2102.12452 , urldate =. arXiv , keywords =:2102.12452 , primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2102.12452 2021
-
[6]
Eliciting Latent Predictions from Transformers with the Tuned Lens
Belrose, Nora and Ostrovsky, Igor and McKinney, Lev and Furman, Zach and Smith, Logan and Halawi, Danny and Biderman, Stella and Steinhardt, Jacob , year = 2025, month = aug, number =. Eliciting. doi:10.48550/arXiv.2303.08112 , urldate =. arXiv , keywords =:2303.08112 , primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08112 2025
-
[7]
Betley, Jan and Tan, Daniel and Warncke, Niels and. Emergent. doi:10.48550/arXiv.2502.17424 , urldate =. arXiv , keywords =:2502.17424 , primaryclass =
-
[8]
Discovering Latent Knowledge in Language Models Without Supervision
Burns, Collin and Ye, Haotian and Klein, Dan and Steinhardt, Jacob , year = 2024, month = mar, number =. Discovering. doi:10.48550/arXiv.2212.03827 , urldate =. arXiv , keywords =:2212.03827 , primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.03827 2024
-
[9]
Chen, Guorui and Xia, Yifan and Jia, Xiaojun and Li, Zhijiang and Torr, Philip and Gu, Jindong , editor =. Findings of the. doi:10.18653/v1/2025.findings-emnlp.309 , urldate =
-
[10]
Chen, Runjin and Arditi, Andy and Sleight, Henry and Evans, Owain and Lindsey, Jack , year = 2025, month = jul, number =. Persona. doi:10.48550/arXiv.2507.21509 , urldate =. arXiv , keywords =:2507.21509 , primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.21509 2025
-
[11]
doi:10.48550/arXiv.2403.10949 , urldate =
Chen, Haozhe and Vondrick, Carl and Mao, Chengzhi , year = 2024, month = mar, number =. doi:10.48550/arXiv.2403.10949 , urldate =. arXiv , keywords =:2403.10949 , primaryclass =
-
[12]
Costarelli, Anthony and Allen, Mat and Field, Severin , year = 2024, publisher =. Meta-. doi:10.48550/ARXIV.2410.02472 , urldate =
-
[13]
Cywi. Eliciting. doi:10.48550/arXiv.2510.01070 , urldate =. arXiv , keywords =:2510.01070 , primaryclass =
-
[14]
Dong, Jianshuo and Zhang, Yutong and Yan, Liu and Zhong, Zhenyu and Wei, Tao and Xu, Ke and Huang, Minlie and Zhang, Chao and Qiu, Han , editor =. ``. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.1082 , urldate =
-
[15]
Farquhar, Sebastian and Varma, Vikrant and Kenton, Zachary and Gasteiger, Johannes and Mikulik, Vladimir and Shah, Rohin , year = 2023, month = dec, number =. Challenges with Unsupervised. doi:10.48550/arXiv.2312.10029 , urldate =. arXiv , keywords =:2312.10029 , primaryclass =
-
[16]
Jailbreak LLM s through Internal Stance Manipulation
Fu, Shuangjie and Su, Du and Huang, Beining and Sun, Fei and Wang, Jingang and Chen, Wei and Shen, Huawei and Cheng, Xueqi , editor =. Jailbreak. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.780 , urldate =
-
[17]
Geva, Mor and Bastings, Jasmijn and Filippova, Katja and Globerson, Amir , year = 2023, month = oct, number =. Dissecting. doi:10.48550/arXiv.2304.14767 , urldate =. arXiv , keywords =:2304.14767 , primaryclass =
-
[18]
Ghandeharioun, Asma and Caciularu, Avi and Pearce, Adam and Dixon, Lucas and Geva, Mor , year = 2024, month = jun, eprint =. Patchscopes:. doi:10.48550/arXiv.2401.06102 , urldate =
-
[19]
Forty-Second
Detecting. Forty-Second
-
[20]
and Andreas, Jacob , year = 2024, month = aug, number =
Hernandez, Evan and Li, Belinda Z. and Andreas, Jacob , year = 2024, month = aug, number =. Inspecting and. doi:10.48550/arXiv.2304.00740 , urldate =. arXiv , keywords =:2304.00740 , primaryclass =
-
[21]
Hernandez, Evan and Sharma, Arnab Sen and Haklay, Tal and Meng, Kevin and Wattenberg, Martin and Andreas, Jacob and Belinkov, Yonatan and Bau, David , year = 2024, month = feb, number =. Linearity of. doi:10.48550/arXiv.2308.09124 , urldate =. arXiv , keywords =:2308.09124 , primaryclass =
-
[22]
Hewitt, John and Liang, Percy , year = 2019, month = sep, number =. Designing and. doi:10.48550/arXiv.1909.03368 , urldate =. arXiv , keywords =:1909.03368 , primaryclass =
-
[23]
doi:10.48550/arXiv.2405.17653 , urldate =
Huang, Xinting and Panwar, Madhur and Goyal, Navin and Hahn, Michael , year = 2024, month = nov, number =. doi:10.48550/arXiv.2405.17653 , urldate =. arXiv , keywords =:2405.17653 , primaryclass =
-
[24]
and Schwettmann, Sarah and Steinhardt, Jacob , year = 2025, month = dec, urldate =
Huang, Vincent and Choi, Dami and Johnson, Daniel D. and Schwettmann, Sarah and Steinhardt, Jacob , year = 2025, month = dec, urldate =. Predictive
2025
-
[25]
Hung, Kuo-Han and Ko, Ching-Yun and Rawat, Ambrish and Chung, I-Hsin and Hsu, Winston H. and Chen, Pin-Yu , editor =. Attention. Findings of the. doi:10.18653/v1/2025.findings-naacl.123 , urldate =
-
[26]
Karvonen, Adam and Chua, James and Dumas, Cl. Activation. doi:10.48550/arXiv.2512.15674 , urldate =. arXiv , keywords =:2512.15674 , primaryclass =
-
[27]
Lawson, Tim and Farnik, Lucy and Houghton, Conor and Aitchison, Laurence , year = 2025, month = feb, number =. Residual. doi:10.48550/arXiv.2409.04185 , urldate =. arXiv , keywords =:2409.04185 , primaryclass =
-
[28]
Li, Zekun and Peng, Baolin and He, Pengcheng and Yan, Xifeng , editor =. Evaluating the. Proceedings of the 2024. doi:10.18653/v1/2024.emnlp-main.33 , urldate =
-
[29]
In: Annual Meeting of the ACL and IJCNLP
Li, Belinda Z. and Nye, Maxwell and Andreas, Jacob , editor =. Implicit. Proceedings of the 59th. doi:10.18653/v1/2021.acl-long.143 , urldate =
-
[30]
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
Li, Kenneth and Patel, Oam and Vi. Inference-. doi:10.48550/arXiv.2306.03341 , urldate =. arXiv , keywords =:2306.03341 , primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2306.03341
-
[31]
Do Activation Verbalization Methods Convey Privileged Information?
Li, Millicent and Arroyo, Alberto Mario Ceballos and Rogers, Giordano and Saphra, Naomi and Wallace, Byron C. , year = 2025, month = dec, number =. Do. doi:10.48550/arXiv.2509.13316 , urldate =. arXiv , keywords =:2509.13316 , primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.13316 2025
-
[32]
Lin, Yuping and He, Pengfei and Xu, Han and Xing, Yue and Yamada, Makoto and Liu, Hui and Tang, Jiliang , editor =. Towards. Proceedings of the 2024. doi:10.18653/v1/2024.emnlp-main.401 , urldate =
-
[33]
Li, Belinda Z. and Guo, Zifan Carl and Huang, Vincent and Steinhardt, Jacob and Andreas, Jacob , year = 2025, month = nov, number =. Training. doi:10.48550/arXiv.2511.08579 , urldate =. arXiv , keywords =:2511.08579 , primaryclass =
-
[34]
Li, Minghui and Zhang, Hao and Zhang, Yechao and Wan, Wei and Hu, Shengshan and Xiaobing, Pei and Wang, Jing , editor =. Transferable. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.102 , urldate =
-
[35]
Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae , year = 2023, month = dec, number =. Visual. doi:10.48550/arXiv.2304.08485 , urldate =. arXiv , keywords =:2304.08485 , primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2304.08485 2023
-
[36]
doi:10.48550/ARXIV.2506.07406 , urldate =
Luo, Yifan and Zhou, Zhennan and Dong, Bin , year = 2025, publisher =. doi:10.48550/ARXIV.2506.07406 , urldate =
-
[37]
doi:10.48550/arXiv.2503.10965 , urldate =
Auditing Language Models for Hidden Objectives , author =. doi:10.48550/arXiv.2503.10965 , urldate =. arXiv , keywords =:2503.10965 , primaryclass =
-
[38]
Locating and Editing Factual Associations in GPT
Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , year = 2023, month = jan, number =. Locating and. doi:10.48550/arXiv.2202.05262 , urldate =. arXiv , keywords =:2202.05262 , primaryclass =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2202.05262 2023
-
[39]
Minder, Julian and Dumas, Cl. Narrow. doi:10.48550/arXiv.2510.13900 , urldate =. arXiv , keywords =:2510.13900 , primaryclass =
-
[40]
Morris, John and Kuleshov, Volodymyr and Shmatikov, Vitaly and Rush, Alexander , editor =. Text. Proceedings of the 2023. doi:10.18653/v1/2023.emnlp-main.765 , urldate =
-
[41]
and Ren, Xiang and Swayamdipta, Swabha , year = 2025, month = jun, number =
Nazir, Murtaza and Finlayson, Matthew and Morris, John X. and Ren, Xiang and Swayamdipta, Swabha , year = 2025, month = jun, number =. Better. doi:10.48550/arXiv.2506.17090 , urldate =. arXiv , keywords =:2506.17090 , primaryclass =
-
[42]
Nikolaou, Giorgos and Mencattini, Tommaso and Crisostomi, Donato and Santilli, Andrea and Panagakis, Yannis and Rodol. Language. doi:10.48550/arXiv.2510.15511 , urldate =. arXiv , keywords =:2510.15511 , primaryclass =
-
[43]
Ortu, Francesco and Jin, Zhijing and Doimo, Diego and Sachan, Mrinmaya and Cazzaniga, Alberto and Sch. Competition of. doi:10.48550/arXiv.2402.11655 , urldate =. arXiv , keywords =:2402.11655 , primaryclass =
-
[44]
and Bau, David , year = 2023, eprint =
Pal, Koyena and Sun, Jiuding and Yuan, Andrew and Wallace, Byron C. and Bau, David , year = 2023, eprint =. Future. Proceedings of the 27th. doi:10.18653/v1/2023.conll-1.37 , urldate =
-
[45]
doi:10.48550/arXiv.2412.08686 , urldate =
Pan, Alexander and Chen, Lijie and Steinhardt, Jacob , year = 2024, month = dec, number =. doi:10.48550/arXiv.2412.08686 , urldate =. arXiv , keywords =:2412.08686 , primaryclass =
-
[46]
Ribeiro, Pedro Schindler Freire Brasil and Brito, Iago Alves and Sousa, Rafael Teixeira and F. Proxy. Findings of the. doi:10.18653/v1/2025.findings-emnlp.528 , urldate =
-
[47]
, year = 2025, month = nov, number =
Skapars, Adrians and Manino, Edoardo and Sun, Youcheng and Cordeiro, Lucas C. , year = 2025, month = nov, number =. doi:10.48550/arXiv.2507.01693 , urldate =. arXiv , keywords =:2507.01693 , primaryclass =
-
[48]
Wang, Cheng and Wei, Zeming and Liu, Qin and Chen, Muhao , year = 2025, month = dec, number =. False. doi:10.48550/arXiv.2509.03888 , urldate =. arXiv , keywords =:2509.03888 , primaryclass =
-
[49]
Wen, Tongyu and Wang, Chenglong and Yang, Xiyuan and Tang, Haoyu and Xie, Yueqi and Lyu, Lingjuan and Dou, Zhicheng and Wu, Fangzhao , editor =. Defending against. Findings of the. doi:10.18653/v1/2025.findings-emnlp.1060 , urldate =
-
[50]
Weng, Zixuan and Jin, Xiaolong and Jia, Jinyuan and Zhang, Xiangyu , editor =. Foot-. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.100 , urldate =
-
[51]
Wen, Yuxin and Jain, Neel and Kirchenbauer, John and Goldblum, Micah and Geiping, Jonas and Goldstein, Tom , year = 2023, month = jun, number =. Hard. doi:10.48550/arXiv.2302.03668 , urldate =. arXiv , keywords =:2302.03668 , primaryclass =
-
[52]
doi:10.48550/arXiv.2502.12134 , urldate =
Xu, Yige and Guo, Xu and Zeng, Zhiwei and Miao, Chunyan , year = 2025, month = may, number =. doi:10.48550/arXiv.2502.12134 , urldate =. arXiv , keywords =:2502.12134 , primaryclass =
-
[53]
Yeo, Wei Jie and Prakash, Nirmalendu and Neo, Clement and Satapathy, Ranjan and Lee, Roy Ka-Wei and Cambria, Erik , editor =. Understanding. Findings of the. doi:10.18653/v1/2025.findings-emnlp.338 , urldate =
-
[54]
Yuan, Tongxin and He, Zhiwei and Dong, Lingzhong and Wang, Yiming and Zhao, Ruijie and Xia, Tian and Xu, Lizhen and Zhou, Binglin and Li, Fangqi and Zhang, Zhuosheng and Wang, Rui and Liu, Gongshen , editor =. R-. Findings of the. doi:10.18653/v1/2024.findings-emnlp.79 , urldate =
-
[55]
Zhao, Weixiang and Guo, Jiahe and Hu, Yulin and Deng, Yang and Zhang, An and Sui, Xingyu and Han, Xinyang and Zhao, Yanyan and Qin, Bing and Chua, Tat-Seng and Liu, Ting , editor =. Proceedings of the 2025. doi:10.18653/v1/2025.emnlp-main.1248 , urldate =
-
[56]
Zhao, Wei and Li, Zhe and Li, Yige and Zhang, Ye and Sun, Jun , editor =. Defending. Findings of the. doi:10.18653/v1/2024.findings-emnlp.293 , urldate =
-
[57]
Zhou, Zhenhong and Yu, Haiyang and Zhang, Xinghua and Xu, Rongwu and Huang, Fei and Li, Yongbin , editor =. How. Findings of the. doi:10.18653/v1/2024.findings-emnlp.139 , urldate =
-
[58]
arXiv preprint arXiv:2412.09565 , year=
Obfuscated activations bypass LLM latent-space defenses , author=. arXiv preprint arXiv:2412.09565 , year=
-
[59]
arXiv preprint arXiv:2506.14261 , year=
RL-Obfuscation: Can Language Models Learn to Evade Latent-Space Monitors? , author=. arXiv preprint arXiv:2506.14261 , year=
-
[60]
arXiv preprint arXiv:2510.11905 , year=
LLM Knowledge is Brittle: Truthfulness Representations Rely on Superficial Resemblance , author=. arXiv preprint arXiv:2510.11905 , year=
-
[61]
and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M
Fraser-Taliente, Kit and Kantamneni, Subhash and Ong, Euan and Mossing, Dan and Lu, Christina and Bogdan, Paul C. and Ameisen, Emmanuel and Chen, James and Kishylau, Dzmitry and Pearce, Adam and Tarng, Julius and Wu, Alex and Wu, Jeff and Zhang, Yang and Ziegler, Daniel M. and Hubinger, Evan and Batson, Joshua and Lindsey, Jack and Zimmerman, Samuel and M...
-
[62]
2020 , howpublished =
John Schulman , title =. 2020 , howpublished =
2020
-
[63]
Advances in neural information processing systems , volume=
Deceptionbench: A comprehensive benchmark for ai deception behaviors in real-world scenarios , author=. Advances in neural information processing systems , volume=
-
[64]
2024 , eprint=
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author=. 2024 , eprint=
2024
-
[65]
2024 , eprint=
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. 2024 , eprint=
2024
-
[66]
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations
Ding, Ning and Chen, Yulin and Xu, Bokai and Qin, Yujia and Hu, Shengding and Liu, Zhiyuan and Sun, Maosong and Zhou, Bowen. Enhancing Chat Language Models by Scaling High-quality Instructional Conversations. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.183
-
[67]
2023 , eprint=
Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=
2023
-
[68]
2025 , eprint=
Generalizing Verifiable Instruction Following , author=. 2025 , eprint=
2025
-
[69]
A survey on large language model based autonomous agents , volume=
Wang, Lei and Ma, Chen and Feng, Xueyang and Zhang, Zeyu and Yang, Hao and Zhang, Jingsen and Chen, Zhiyuan and Tang, Jiakai and Chen, Xu and Lin, Yankai and Zhao, Wayne Xin and Wei, Zhewei and Wen, Jirong , year=. A survey on large language model based autonomous agents , volume=. Frontiers of Computer Science , publisher=. doi:10.1007/s11704-024-40231-1...
-
[70]
2023 , eprint=
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs , author=. 2023 , eprint=
2023
-
[71]
2025 , eprint=
Large Language Models Often Know When They Are Being Evaluated , author=. 2025 , eprint=
2025
-
[72]
2022 , eprint=
Ignore Previous Prompt: Attack Techniques For Language Models , author=. 2022 , eprint=
2022
-
[73]
2023 , eprint=
Not what you've signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection , author=. 2023 , eprint=
2023
-
[74]
Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.1 , pages =
Yi, Jingwei and Xie, Yueqi and Zhu, Bin and Kiciman, Emre and Sun, Guangzhong and Xie, Xing and Wu, Fangzhao , year=. Benchmarking and Defending against Indirect Prompt Injection Attacks on Large Language Models , url=. doi:10.1145/3690624.3709179 , booktitle=
-
[75]
Hashimoto , title =
Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =
2023
-
[76]
GitHub repository , howpublished =
prompts.chat: A Curated Collection of Prompt Examples for AI Chat Models , year =. GitHub repository , howpublished =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.