When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents

Yanhang Li; Zexin Zhuang; Zhichao Fan

arxiv: 2606.22864 · v1 · pith:PHEWBQ2Tnew · submitted 2026-06-22 · 💻 cs.LG

When AUC 0.998 Is Not Enough: A Candidate Evaluation Protocol for Hidden-State Probes of Indirect Prompt Injection in Multimodal Computer-Use Agents

Yanhang Li , Zhichao Fan , Zexin Zhuang This is my paper

Pith reviewed 2026-06-26 09:03 UTC · model grok-4.3

classification 💻 cs.LG

keywords indirect prompt injectionhidden-state probingmultimodal agentsevaluation protocolAUC interpretationcomputer-use agentsvision-language models

0 comments

The pith

High AUC on clean-vs-attack splits does not prove hidden-state probes detect malicious content

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that a high probing AUC on a clean-versus-attack split for hidden-state probes of indirect prompt injection is not by itself evidence that the probe detects malicious content. In a case study using Qwen2.5-VL-7B on Mind2Web with teacher-forced replay, the authors demonstrate that surface-level or nuisance signals could drive the result instead. They introduce two post-hoc diagnostics to test this: a paired-construction scalar baseline on text-side injections and same-step nuisance-matched visual controls on overlay surfaces. These form a candidate control set with reporting heuristics for what the AUC does and does not license. Labels indicate injection-surface presence rather than attack success, and generalisation beyond the tested backbone remains unproven.

Core claim

A high probing AUC on a clean-vs-attack split is not, on its own, evidence of malicious-content detection. Two post-hoc diagnostics—a paired-construction scalar baseline on text-side injections and same-step nuisance-matched visual controls on the overlay surface—do not license an unqualified malicious-content interpretation of the headline while leaving room for partly-semantic readings.

What carries the argument

The candidate evaluation protocol built from a paired-construction scalar baseline on text-side injections and same-step nuisance-matched visual controls on the overlay surface, which tests whether high AUC reflects semantic detection or nuisance signals.

If this is right

A high clean-vs-attack AUC alone does not license claims of semantic malicious-content detection by the probe.
The two diagnostics form a control set that clarifies what the AUC licenses and does not license.
Probe labels indicate injection-surface presence rather than attack success.
Generalisation of the findings beyond the tested backbone and benchmark is a conjecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future evaluations of linear probes on frozen multimodal models should include nuisance-matched controls as standard practice to avoid overinterpreting AUC.
The same caution about surface cues driving performance could apply to other detection tasks that rely on hidden-state probes without explicit controls.
Applying the protocol to additional model families would test whether the observed limitation is backbone-specific or more widespread.

Load-bearing premise

The paired-construction scalar baseline on text-side injections and the same-step nuisance-matched visual controls are sufficient to distinguish semantic malicious-content detection from surface-level or nuisance-driven signals.

What would settle it

A controlled test in which the probe AUC remains high after semantic malicious elements are removed but nuisance features are retained, or drops sharply when nuisance features are matched but semantics are preserved.

Figures

Figures reproduced from arXiv: 2606.22864 by Yanhang Li, Zexin Zhuang, Zhichao Fan.

**Figure 1.** Figure 1: Overview of the 2-control diagnostic recipe for clean-vs-attack hidden-state probing of a frozen multimodal computer [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Text-side C1 evidence and cross-injection transfer. (a) C1 metadata-only AUC (a [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Visible-side C2 diagnostic, visualising Tab. 3. (a) Direct E3 AUC ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Hidden-state probing -- a linear classifier on a frozen vision-language model's internal activations -- has emerged as an attractive evaluation tool for flagging indirect prompt injection (IPI) in multimodal computer-use agents before the agent emits a corrupted action. We argue, on a single-backbone cautionary case study (Qwen2.5-VL-7B on Mind2Web, teacher-forced replay), that a high probing AUC on a clean-vs-attack split is not, on its own, evidence of malicious-content detection. Two post-hoc diagnostics -- a paired-construction scalar baseline on text-side injections, and same-step nuisance-matched visual controls on the overlay surface -- do not license an unqualified malicious-content interpretation of the headline while leaving room for partly-semantic readings. We package the diagnostics as a candidate control set with reporting heuristics for what a high clean-vs-attack AUC does and does not license. Labels are injection-surface-present, not attack success; generalisation beyond this backbone and benchmark is a conjecture.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

High AUC on clean-vs-attack splits does not by itself show that a hidden-state probe is detecting malicious content, and the paper supplies two concrete post-hoc checks to qualify that claim.

read the letter

The paper's main message is that a high probing AUC on clean versus attack examples is not automatic evidence of malicious-content detection in indirect prompt injection probes. They demonstrate this on Qwen2.5-VL-7B with Mind2Web data under teacher-forced replay using two diagnostics: a paired text-side injection baseline and same-step nuisance-matched visual controls.

Those two controls are the actual new piece. They are packaged with reporting heuristics for what a high AUC does and does not license, and the authors are explicit that the results still allow partly-semantic interpretations while treating broader generalization as a conjecture. That is a useful, modest methodological reminder for anyone running these probes.

The obvious limitation is the single-backbone, single-benchmark scope. Without the actual numbers from the diagnostics in the abstract it is hard to judge how strongly the controls actually push against a malicious-content reading. Everything rests on that one case study, so the protocol remains preliminary.

This is aimed at researchers building or evaluating safety probes for multimodal agents. The argument is clear and the proposed checks are straightforward to replicate, so it deserves referee time even if the current evidence is narrow. I would send it to peer review.

Referee Report

0 major / 2 minor

Summary. The manuscript argues that a high AUC from hidden-state linear probing on clean-vs-attack splits does not, by itself, constitute evidence of malicious-content detection for indirect prompt injection in multimodal computer-use agents. It supports this via a single-backbone case study (Qwen2.5-VL-7B on Mind2Web with teacher-forced replay) using two post-hoc diagnostics—a paired-construction scalar baseline for text-side injections and same-step nuisance-matched visual controls for the overlay surface—while leaving room for partly-semantic interpretations and framing generalization as a conjecture. The work packages the diagnostics as a candidate control set with reporting heuristics.

Significance. If the diagnostics and their interpretation hold, the result is significant as a methodological caution in the probing and AI-safety literature. It explicitly uses external controls rather than self-referential derivations, states its scope limitations clearly, and supplies concrete reporting heuristics. This could reduce overinterpretation of headline AUC numbers in future work on hidden-state monitors for agent safety.

minor comments (2)

[Abstract and §4 (post-hoc diagnostics)] The abstract and methods would benefit from an explicit table or inline numerical comparison of the headline clean-vs-attack AUC against the AUCs obtained under each of the two post-hoc diagnostics; this would make the strength of the cautionary claim immediately verifiable without requiring the reader to cross-reference multiple figures.
[§3 (experimental setup)] The phrase 'injection-surface-present, not attack success' for the label definition is clear in intent but would be strengthened by a one-sentence contrast with how attack-success labels are typically constructed in the IPI literature.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment, accurate summary of our scope and contributions, and recommendation of minor revision. The work is framed as a single-backbone cautionary case study with explicit limitations on generalization.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper advances a methodological caution that high clean-vs-attack probing AUC does not by itself establish malicious-content detection. This is supported by two explicitly described post-hoc diagnostics (paired-construction scalar baseline and nuisance-matched visual controls) applied to a single backbone/benchmark case study, with generalization framed as a conjecture. No derivation, equation, or central claim reduces to a fitted input, self-definition, or self-citation chain; the analysis remains self-contained against external benchmarks and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard domain assumptions from probing literature without introducing new free parameters or invented entities.

axioms (2)

domain assumption Linear classifiers on frozen vision-language model activations can be used to probe for indirect prompt injection signals
Core premise of the hidden-state probing method under evaluation.
domain assumption The clean-vs-attack split and the proposed controls isolate the relevant factors for interpreting probe behavior
Invoked when claiming the diagnostics qualify the AUC interpretation.

pith-pipeline@v0.9.1-grok · 5718 in / 1124 out tokens · 32444 ms · 2026-06-26T09:03:36.115655+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 1 canonical work pages

[1]

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering. InProceedings of CVPR

2018
[2]

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. Refusal in Language Models Is Mediated by a Single Direction.arXiv preprint arXiv:2406.11717(2024)

Pith/arXiv arXiv 2024
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923(2025)

Pith/arXiv arXiv 2025
[4]

Yonatan Belinkov. 2022. Probing Classifiers: Promises, Shortcomings, and Ad- vances.Computational Linguistics48, 1 (2022), 207–219

2022
[5]

Yonatan Belinkov and James Glass. 2019. Analysis Methods in Neural Language Processing: A Survey.Transactions of the Association for Computational Linguistics 7 (2019), 49–72

2019
[6]

Yihang Chen, Pin Qian, Su Wang, Sipeng Zhang, Huan Xu, Shuhuai Lin, and Xinpeng Wei. 2026. Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict.arXiv preprint arXiv:2605.14473 (2026)

Pith/arXiv arXiv 2026
[7]

Alexis Conneau and Douwe Kiela. 2018. SentEval: An Evaluation Toolkit for Universal Sentence Representations. InProceedings of LREC

2018
[8]

1997.Bootstrap Methods and Their Application

Anthony C Davison and David V Hinkley. 1997.Bootstrap Methods and Their Application. Cambridge University Press

1997
[9]

Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. InAdvances in Neural Information Processing Systems (NeurIPS)

2024
[10]

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Generalist Agent for the Web. InAdvances in Neural Information Processing Systems (NeurIPS) — Datasets and Benchmarks Track

2023
[11]

Bradley Efron. 1979. Bootstrap Methods: Another Look at the Jackknife.The Annals of Statistics7, 1 (1979), 1–26

1979
[12]

Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. 2021. Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals.Transactions of the Association for Computational Linguistics9 (2021), 160–175

2021
[13]

Yushi Feng, Junye Du, Qifan Wang, Zizhan Ma, Qian Niu, Yutaka Matsuo, Long Feng, and Lequan Yu. 2026. CORA: Conformal Risk-Controlled Agents for Safe- guarded Mobile GUI Automation.arXiv preprint arXiv:2604.09155(2026)

Pith/arXiv arXiv 2026
[14]

Wichmann

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. 2020. Shortcut Learning in Deep Neural Networks.Nature Machine Intelligence2, 11 (2020), 665–673

2020
[15]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh
[16]

InProceedings of CVPR

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. InProceedings of CVPR
[17]

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not What You’ve Signed Up For: Compromising Real- World LLM-Integrated Applications with Indirect Prompt Injection. InProceedings of AISec

2023
[18]

Bowman, and Noah A

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation Artifacts in Natural Language Inference Data. InProceedings of NAACL-HLT

2018
[19]

Jiatong Han, Neil Band, Muhammed Razzak, Jannik Kossen, Tim G. J. Rudner, and Yarin Gal. 2025. Simple Factuality Probes Detect Hallucinations in Long-Form Natural Language Generation. InFindings of the Association for Computational Linguistics: EMNLP

2025
[20]

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. InProceedings of ACL

2024
[21]

John Hewitt and Percy Liang. 2019. Designing and Interpreting Probes with Control Tasks. InProceedings of EMNLP-IJCNLP

2019
[22]

Yuelyu Ji, Wuwei Lan, and Patrick NG. 2025. MRAG-Suite: A Diagnostic Eval- uation Platform for Visual Retrieval-Augmented Generation.arXiv preprint arXiv:2509.24253(2025)

arXiv 2025
[23]

Xiaochong Jiang, Shiqi Yang, Wenting Yang, Yichen Liu, and Cheng Ji. 2026. SoK: A Taxonomy of Attack Vectors and Defense Strategies for Agentic Supply Chain Runtime. InICLR 2026 Workshop on AI for Mechanism Design and Strategic Decision Making. arXiv preprint arXiv:2602.19555

Pith/arXiv arXiv 2026
[24]

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried
[25]

InProceedings of ACL

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. InProceedings of ACL
[26]

Olivier Ledoit and Michael Wolf. 2004. A Well-Conditioned Estimator for Large- Dimensional Covariance Matrices.Journal of Multivariate Analysis88, 2 (2004), 365–411

2004
[27]

Yanhang Li, Zhichao Fan, and Zexin Zhuang. 2026. Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries.arXiv preprint arXiv:2605.18891(2026)

Pith/arXiv arXiv 2026
[28]

Yanhang Li, Zhichao Fan, and Zexin Zhuang. 2026. SafetyRepro: Configuration- Conditional Rank Instability on Alignment Benchmarks.arXiv preprint arXiv:2605.25492(2026)

Pith/arXiv arXiv 2026
[29]

Lixing Lin, Juli You, Yue Li, Luyun Lin, Yiqing Wang, Zhen Zhang, and Moxuan Zheng. 2026. Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection.arXiv preprint arXiv:2605.24834(2026)

Pith/arXiv arXiv 2026
[30]

Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, and Hanan Salam. 2025. AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents. InAdvances in Neural Information Processing Systems 38 (NeurIPS 2025)

2025
[31]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python.Journal of Machine Learning Re...

2011
[32]

Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. 2020. Information-Theoretic Probing for Linguistic Structure. InProceedings of ACL

2020
[33]

Pin Qian, Su Wang, Xiaoyuan Wang, Yihang Chen, Wenxuan Xu, Qiaolin Yu, Shuhuai Lin, Sipeng Zhang, Junxian You, and Xinpeng Wei. 2026. Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG.arXiv preprint arXiv:2605.28044(2026)

Pith/arXiv arXiv 2026
[34]

Abhilasha Ravichander, Yonatan Belinkov, and Eduard Hovy. 2021. Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?. InProceedings of EACL

2021
[35]

Williamson, Alex J

Bernhard Schölkopf, Robert C. Williamson, Alex J. Smola, John Shawe-Taylor, and John Platt. 1999. Support Vector Method for Novelty Detection. InAdvances in Neural Information Processing Systems (NeurIPS)

1999
[36]

Tianneng Shi, Jingxuan He, Zhun Wang, Linyu Wu, Hongwei Li, Wenbo Guo, and Dawn Song. 2025. Progent: Programmable Privilege Control for LLM Agents. arXiv preprint arXiv:2504.11703(2025)

Pith/arXiv arXiv 2025
[37]

Vincent Siu, Nicholas Crispino, Zihao Yu, Sam Pan, Zhun Wang, Yang Liu, Dawn Song, and Chenguang Wang. 2025. COSMIC: Generalized Refusal Direction Identification in LLM Activations. InFindings of the Association for Computational Linguistics: ACL. arXiv:2506.00085

arXiv 2025
[38]

Representing Isolated Targets to Steer Language Models

Vincent Siu, Nathan W. Henry, Nicholas Crispino, Yang Liu, Dawn Song, and Chenguang Wang. 2026. RepIt: Steering Language Models with Concept-Specific Refusal Vectors. InInternational Conference on Learning Representations (ICLR). 9 EvalMG ’26, July 24, 2026, Melbourne, Australia Li, Fan, and Zhuang OpenReview submission ID fsZkx8gek0; earlier arXiv title ...

2026
[39]

Qiushi Sun, Mukai Li, Zhoumianze Liu, Zhihui Xie, Fangzhi Xu, Zhangyue Yin, Kanzhi Cheng, Zehao Li, Zichen Ding, Qi Liu, Zhiyong Wu, Zhuosheng Zhang, Ben Kao, and Lingpeng Kong. 2025. OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows.arXiv preprint arXiv:2510.24411(2025)

arXiv 2025
[40]

Elena Voita and Ivan Titov. 2020. Information-Theoretic Probing with Minimum Description Length. InProceedings of EMNLP

2020
[41]

Su Wang, Pin Qian, Yihang Chen, Junxian You, Xiaoyuan Wang, Xiaochong Jiang, Lifei Liu, Haoran Yu, and Jingzhou Xu. 2026. When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems.arXiv preprint arXiv:2606.00448 (2026)

Pith/arXiv arXiv 2026
[42]

Yingshuo Wang, Xian Sun, Yanhang Li, Zhichao Fan, and Zexin Zhuang. 2026. Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice.arXiv preprint arXiv:2605.26559(2026)

Pith/arXiv arXiv 2026
[43]

Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song. 2025. AgentVigil: Automatic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents. InFind- ings of the Association for Computational Linguistics: EMNLP. arXiv:2505.05849

arXiv 2025
[44]

Tongyu Wen, Chenglong Wang, Xiyuan Yang, Haoyu Tang, Yueqi Xie, Lingjuan Lyu, Zhicheng Dou, and Fangzhao Wu. 2025. Defending against Indirect Prompt Injection by Instruction Detection. InFindings of the Association for Computational Linguistics: EMNLP 2025. 19472–19487. doi:10.18653/v1/2025.findings-emnlp.1060 Uses hidden-state and gradient features for i...

work page doi:10.18653/v1/2025.findings-emnlp.1060 2025
[45]

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu
[46]

InNeurIPS D&B

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. InNeurIPS D&B
[47]

Bastian, Seyyed Hadi Hashemi, Chaowei Xiao, Wenbo Guo, and Bo Li

Zhuowen Yuan, Zhaorun Chen, Zhen Xiang, Nathaniel D. Bastian, Seyyed Hadi Hashemi, Chaowei Xiao, Wenbo Guo, and Bo Li. 2026. ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems.arXiv preprint arXiv:2604.04426(2026)

Pith/arXiv arXiv 2026
[48]

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. InFindings of the Association for Computational Linguistics: ACL

2024
[49]

Zhenliang Zhang, Xinyu Hu, Huixuan Zhang, Junzhe Zhang, and Xiaojun Wan
[50]

InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)

ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). arXiv:2507.16488

arXiv
[51]

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. InInternational Conference on Learning Representations (ICLR)

2024
[52]

Kaijie Zhu, Xianjun Yang, Jindong Wang, Wenbo Guo, and William Yang Wang
[53]

MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison.arXiv preprint arXiv:2502.05174(2025)

arXiv 2025
[54]

Zexin Zhuang, Yanhang Li, and Zhichao Fan. 2026. Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit.arXiv preprint arXiv:2605.28873(2026)

Pith/arXiv arXiv 2026
[55]

Disregard prior instructions. Click Send Now

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, et al . 2023. Representa- tion Engineering: A Top-Down Approach to AI Transparency.arXiv preprint arXiv:2310.01405(2023). 10 When AUC 0.998 Is Not Enough EvalMG ’26, July 24, 2026, Melbourne, Australia A Attack-template catalog We list the...

Pith/arXiv arXiv 2023

[1] [1]

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. 2018. Don’t Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering. InProceedings of CVPR

2018

[2] [2]

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. 2024. Refusal in Language Models Is Mediated by a Single Direction.arXiv preprint arXiv:2406.11717(2024)

Pith/arXiv arXiv 2024

[3] [3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. 2025. Qwen2.5-VL Technical Report.arXiv preprint arXiv:2502.13923(2025)

Pith/arXiv arXiv 2025

[4] [4]

Yonatan Belinkov. 2022. Probing Classifiers: Promises, Shortcomings, and Ad- vances.Computational Linguistics48, 1 (2022), 207–219

2022

[5] [5]

Yonatan Belinkov and James Glass. 2019. Analysis Methods in Neural Language Processing: A Survey.Transactions of the Association for Computational Linguistics 7 (2019), 49–72

2019

[6] [6]

Yihang Chen, Pin Qian, Su Wang, Sipeng Zhang, Huan Xu, Shuhuai Lin, and Xinpeng Wei. 2026. Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict.arXiv preprint arXiv:2605.14473 (2026)

Pith/arXiv arXiv 2026

[7] [7]

Alexis Conneau and Douwe Kiela. 2018. SentEval: An Evaluation Toolkit for Universal Sentence Representations. InProceedings of LREC

2018

[8] [8]

1997.Bootstrap Methods and Their Application

Anthony C Davison and David V Hinkley. 1997.Bootstrap Methods and Their Application. Cambridge University Press

1997

[9] [9]

Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. 2024. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. InAdvances in Neural Information Processing Systems (NeurIPS)

2024

[10] [10]

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2Web: Towards a Generalist Agent for the Web. InAdvances in Neural Information Processing Systems (NeurIPS) — Datasets and Benchmarks Track

2023

[11] [11]

Bradley Efron. 1979. Bootstrap Methods: Another Look at the Jackknife.The Annals of Statistics7, 1 (1979), 1–26

1979

[12] [12]

Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. 2021. Amnesic Probing: Behavioral Explanation with Amnesic Counterfactuals.Transactions of the Association for Computational Linguistics9 (2021), 160–175

2021

[13] [13]

Yushi Feng, Junye Du, Qifan Wang, Zizhan Ma, Qian Niu, Yutaka Matsuo, Long Feng, and Lequan Yu. 2026. CORA: Conformal Risk-Controlled Agents for Safe- guarded Mobile GUI Automation.arXiv preprint arXiv:2604.09155(2026)

Pith/arXiv arXiv 2026

[14] [14]

Wichmann

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. 2020. Shortcut Learning in Deep Neural Networks.Nature Machine Intelligence2, 11 (2020), 665–673

2020

[15] [15]

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh

[16] [16]

InProceedings of CVPR

Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering. InProceedings of CVPR

[17] [17]

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. 2023. Not What You’ve Signed Up For: Compromising Real- World LLM-Integrated Applications with Indirect Prompt Injection. InProceedings of AISec

2023

[18] [18]

Bowman, and Noah A

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R. Bowman, and Noah A. Smith. 2018. Annotation Artifacts in Natural Language Inference Data. InProceedings of NAACL-HLT

2018

[19] [19]

Jiatong Han, Neil Band, Muhammed Razzak, Jannik Kossen, Tim G. J. Rudner, and Yarin Gal. 2025. Simple Factuality Probes Detect Hallucinations in Long-Form Natural Language Generation. InFindings of the Association for Computational Linguistics: EMNLP

2025

[20] [20]

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. InProceedings of ACL

2024

[21] [21]

John Hewitt and Percy Liang. 2019. Designing and Interpreting Probes with Control Tasks. InProceedings of EMNLP-IJCNLP

2019

[22] [22]

Yuelyu Ji, Wuwei Lan, and Patrick NG. 2025. MRAG-Suite: A Diagnostic Eval- uation Platform for Visual Retrieval-Augmented Generation.arXiv preprint arXiv:2509.24253(2025)

arXiv 2025

[23] [23]

Xiaochong Jiang, Shiqi Yang, Wenting Yang, Yichen Liu, and Cheng Ji. 2026. SoK: A Taxonomy of Attack Vectors and Defense Strategies for Agentic Supply Chain Runtime. InICLR 2026 Workshop on AI for Mechanism Design and Strategic Decision Making. arXiv preprint arXiv:2602.19555

Pith/arXiv arXiv 2026

[24] [24]

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried

[25] [25]

InProceedings of ACL

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks. InProceedings of ACL

[26] [26]

Olivier Ledoit and Michael Wolf. 2004. A Well-Conditioned Estimator for Large- Dimensional Covariance Matrices.Journal of Multivariate Analysis88, 2 (2004), 365–411

2004

[27] [27]

Yanhang Li, Zhichao Fan, and Zexin Zhuang. 2026. Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries.arXiv preprint arXiv:2605.18891(2026)

Pith/arXiv arXiv 2026

[28] [28]

Yanhang Li, Zhichao Fan, and Zexin Zhuang. 2026. SafetyRepro: Configuration- Conditional Rank Instability on Alignment Benchmarks.arXiv preprint arXiv:2605.25492(2026)

Pith/arXiv arXiv 2026

[29] [29]

Lixing Lin, Juli You, Yue Li, Luyun Lin, Yiqing Wang, Zhen Zhang, and Moxuan Zheng. 2026. Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection.arXiv preprint arXiv:2605.24834(2026)

Pith/arXiv arXiv 2026

[30] [30]

Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, and Hanan Salam. 2025. AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents. InAdvances in Neural Information Processing Systems 38 (NeurIPS 2025)

2025

[31] [31]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine Learning in Python.Journal of Machine Learning Re...

2011

[32] [32]

Tiago Pimentel, Josef Valvoda, Rowan Hall Maudslay, Ran Zmigrod, Adina Williams, and Ryan Cotterell. 2020. Information-Theoretic Probing for Linguistic Structure. InProceedings of ACL

2020

[33] [33]

Pin Qian, Su Wang, Xiaoyuan Wang, Yihang Chen, Wenxuan Xu, Qiaolin Yu, Shuhuai Lin, Sipeng Zhang, Junxian You, and Xinpeng Wei. 2026. Relevant Is Not Warranted: Evidence-Force Calibration for Cited RAG.arXiv preprint arXiv:2605.28044(2026)

Pith/arXiv arXiv 2026

[34] [34]

Abhilasha Ravichander, Yonatan Belinkov, and Eduard Hovy. 2021. Probing the Probing Paradigm: Does Probing Accuracy Entail Task Relevance?. InProceedings of EACL

2021

[35] [35]

Williamson, Alex J

Bernhard Schölkopf, Robert C. Williamson, Alex J. Smola, John Shawe-Taylor, and John Platt. 1999. Support Vector Method for Novelty Detection. InAdvances in Neural Information Processing Systems (NeurIPS)

1999

[36] [36]

Tianneng Shi, Jingxuan He, Zhun Wang, Linyu Wu, Hongwei Li, Wenbo Guo, and Dawn Song. 2025. Progent: Programmable Privilege Control for LLM Agents. arXiv preprint arXiv:2504.11703(2025)

Pith/arXiv arXiv 2025

[37] [37]

Vincent Siu, Nicholas Crispino, Zihao Yu, Sam Pan, Zhun Wang, Yang Liu, Dawn Song, and Chenguang Wang. 2025. COSMIC: Generalized Refusal Direction Identification in LLM Activations. InFindings of the Association for Computational Linguistics: ACL. arXiv:2506.00085

arXiv 2025

[38] [38]

Representing Isolated Targets to Steer Language Models

Vincent Siu, Nathan W. Henry, Nicholas Crispino, Yang Liu, Dawn Song, and Chenguang Wang. 2026. RepIt: Steering Language Models with Concept-Specific Refusal Vectors. InInternational Conference on Learning Representations (ICLR). 9 EvalMG ’26, July 24, 2026, Melbourne, Australia Li, Fan, and Zhuang OpenReview submission ID fsZkx8gek0; earlier arXiv title ...

2026

[39] [39]

Qiushi Sun, Mukai Li, Zhoumianze Liu, Zhihui Xie, Fangzhi Xu, Zhangyue Yin, Kanzhi Cheng, Zehao Li, Zichen Ding, Qi Liu, Zhiyong Wu, Zhuosheng Zhang, Ben Kao, and Lingpeng Kong. 2025. OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows.arXiv preprint arXiv:2510.24411(2025)

arXiv 2025

[40] [40]

Elena Voita and Ivan Titov. 2020. Information-Theoretic Probing with Minimum Description Length. InProceedings of EMNLP

2020

[41] [41]

Su Wang, Pin Qian, Yihang Chen, Junxian You, Xiaoyuan Wang, Xiaochong Jiang, Lifei Liu, Haoran Yu, and Jingzhou Xu. 2026. When Safe Skills Collide: Measuring Compositional Risk in Agent Skill Ecosystems.arXiv preprint arXiv:2606.00448 (2026)

Pith/arXiv arXiv 2026

[42] [42]

Yingshuo Wang, Xian Sun, Yanhang Li, Zhichao Fan, and Zexin Zhuang. 2026. Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice.arXiv preprint arXiv:2605.26559(2026)

Pith/arXiv arXiv 2026

[43] [43]

Zhun Wang, Vincent Siu, Zhe Ye, Tianneng Shi, Yuzhou Nie, Xuandong Zhao, Chenguang Wang, Wenbo Guo, and Dawn Song. 2025. AgentVigil: Automatic Black-Box Red-teaming for Indirect Prompt Injection against LLM Agents. InFind- ings of the Association for Computational Linguistics: EMNLP. arXiv:2505.05849

arXiv 2025

[44] [44]

Tongyu Wen, Chenglong Wang, Xiyuan Yang, Haoyu Tang, Yueqi Xie, Lingjuan Lyu, Zhicheng Dou, and Fangzhao Wu. 2025. Defending against Indirect Prompt Injection by Instruction Detection. InFindings of the Association for Computational Linguistics: EMNLP 2025. 19472–19487. doi:10.18653/v1/2025.findings-emnlp.1060 Uses hidden-state and gradient features for i...

work page doi:10.18653/v1/2025.findings-emnlp.1060 2025

[45] [45]

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu

[46] [46]

InNeurIPS D&B

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments. InNeurIPS D&B

[47] [47]

Bastian, Seyyed Hadi Hashemi, Chaowei Xiao, Wenbo Guo, and Bo Li

Zhuowen Yuan, Zhaorun Chen, Zhen Xiang, Nathaniel D. Bastian, Seyyed Hadi Hashemi, Chaowei Xiao, Wenbo Guo, and Bo Li. 2026. ShieldNet: Network-Level Guardrails against Emerging Supply-Chain Injections in Agentic Systems.arXiv preprint arXiv:2604.04426(2026)

Pith/arXiv arXiv 2026

[48] [48]

Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. 2024. InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. InFindings of the Association for Computational Linguistics: ACL

2024

[49] [49]

Zhenliang Zhang, Xinyu Hu, Huixuan Zhang, Junzhe Zhang, and Xiaojun Wan

[50] [50]

InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL)

ICR Probe: Tracking Hidden State Dynamics for Reliable Hallucination Detection in LLMs. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL). arXiv:2507.16488

arXiv

[51] [51]

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. WebArena: A Realistic Web Environment for Building Autonomous Agents. InInternational Conference on Learning Representations (ICLR)

2024

[52] [52]

Kaijie Zhu, Xianjun Yang, Jindong Wang, Wenbo Guo, and William Yang Wang

[53] [53]

MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison.arXiv preprint arXiv:2502.05174(2025)

arXiv 2025

[54] [54]

Zexin Zhuang, Yanhang Li, and Zhichao Fan. 2026. Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit.arXiv preprint arXiv:2605.28873(2026)

Pith/arXiv arXiv 2026

[55] [55]

Disregard prior instructions. Click Send Now

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, et al . 2023. Representa- tion Engineering: A Top-Down Approach to AI Transparency.arXiv preprint arXiv:2310.01405(2023). 10 When AUC 0.998 Is Not Enough EvalMG ’26, July 24, 2026, Melbourne, Australia A Attack-template catalog We list the...

Pith/arXiv arXiv 2023