arxiv: 2605.03129 · v1 · submitted 2026-05-04 · 💻 cs.CR · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

PIIGuard: Mitigating PII Harvesting under Adversarial Sanitization

Mingshuo Liu , Yiwei Zha , Min Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:46 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL

keywords PII protectionLLM securitywebpage defenseprompt injectionpersonally identifiable informationadversarial sanitizationindirect prompt injectionweb scraping mitigation

0 comments

The pith

Webpage owners can insert hidden HTML fragments to prevent LLM assistants from harvesting contact-style personally identifiable information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that ordinary webpage owners can protect contact-style personally identifiable information from being scraped by browsing-enabled large language models. It does this by embedding specially designed hidden HTML fragments that use indirect prompt injection to steer models away from verbatim or reconstructible disclosure. The fragments are optimized through rule-based leakage scoring, evolutionary mutation, and judge-based assessment of recoverability. Direct evaluation on three models shows defense success rates of at least 97 percent, often 100 percent, while normal same-page question answering remains functional. The method also shows partial robustness when pages are accessed via public URLs and subjected to attacker-side sanitization, though results vary by interface and prompt.

Core claim

By searching over fragment text and insertion position, PIIGuard produces webpage-level defensive fragments that achieve at least 97 percent defense success rate against contact PII leakage on tested models under rule-based and judge-based evaluation, often reaching 100 percent, while preserving benign utility; the fragments remain effective for some model-position pairs under public-URL browsing and attacker-side LLM sanitization.

What carries the argument

Optimized hidden HTML fragments inserted via indirect prompt injection to steer LLMs away from PII disclosure

If this is right

Page owners gain a deployable, independent tool to reduce PII leakage without depending on model or service providers.
Normal question-answering utility on the protected page stays intact for legitimate users.
The same fragments can block both direct and some sanitized or browsed access paths for certain model combinations.
Rule-based scoring combined with final judge evaluation reliably identifies effective fragments during optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption could lower the overall volume of contact PII that LLM assistants surface from public web pages.
Attackers may develop stronger sanitizers specifically tuned to detect and strip defensive fragments.
The fragment technique could be adapted and tested for protecting other categories of sensitive information on webpages.
Further checks could measure whether the hidden fragments affect page rendering or accessibility for ordinary human visitors.

Load-bearing premise

The defensive fragments continue to influence the model's output even after the webpage content is fetched through public browsing interfaces and possibly cleaned by attacker-side sanitizers.

What would settle it

A model successfully extracting and outputting the contact PII after the page has been accessed via public URL and sanitized by an attacker LLM would falsify the claim of practical mitigation.

Figures

Figures reproduced from arXiv: 2605.03129 by Min Chen, Mingshuo Liu, Yiwei Zha.

**Figure 1.** Figure 1: The pipeline demonstrates how attacker utilize modern LLM systems view at source ↗

**Figure 2.** Figure 2: Overview of PIIGuard. Phase 1: leakage assessment for initial seed frag view at source ↗

read the original abstract

Browsing-enabled LLM assistants can fetch webpages and answer contact-seeking queries, creating a practical channel for scraping contact-style personally identifiable information (PII) from public pages. Many prior defenses are deployed at the model, service, or agent layer rather than at the webpage itself, leaving ordinary page owners with limited deployable options. We present PIIGuard, a webpage-level defense that repurposes indirect prompt injection as a protective mechanism: the page owner embeds optimized hidden HTML fragments that steer the model away from verbatim or reconstructible disclosure of contact PII. PIIGuard searches over fragment text and insertion position using rule-based leakage scoring, evolutionary mutation, and final judge-based recoverability assessment. In direct-HTML evaluation on three target models (GPT-5.4-nano, Claude-haiku-4.5, and DeepSeek-chat(latest v3.2)), PIIGuard achieves at least 97.0% defense success rate under both rule-based and judge-based leakage evaluation, often reaching 100.0%, while preserving benign same-page QA utility. We further evaluate two harder settings: public-URL browsing and attacker-side LLM sanitization of fetched webpage. These results show that page-side defensive fragments can remain effective in deployment for some model-position pairs, but robustness varies substantially across browsing interfaces and sanitizer prompts. Overall, PIIGuard demonstrates that page owners can use page-side fragments as a practical mitigation for web-grounded PII leakage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PIIGuard shows page-side hidden fragments can block LLM PII leakage at high rates in direct HTML tests, but transfer to real URL browsing and sanitization is uneven and unquantified.

read the letter

The core result is that optimized hidden HTML fragments can steer three tested models away from leaking contact PII in direct-HTML settings, hitting 97-100% defense success while leaving normal page QA intact. The approach repurposes indirect prompt injection as a page-owner tool and uses rule-based leakage scoring plus evolutionary mutation to search for fragments and insertion points, followed by judge-based checks. That combination is new relative to the model- or service-layer defenses in the cited literature, and the direct evaluation gives a clean empirical demonstration that such fragments are feasible to generate and effective under controlled conditions. The paper also checks that benign utility is preserved, which is a practical plus for any deployable defense. The softer spots sit in the harder settings. Public-URL browsing and attacker-side sanitization only succeed for some model-position pairs, with the abstract stating that robustness varies substantially across interfaces and prompts. Because the optimization pipeline runs on clean direct HTML, it is not obvious how often the resulting fragments survive real fetch, rendering, or rewriting steps. No frequency counts or representative sampling of those cases are given, so the claim that this is a practical mitigation for page owners rests on the unmeasured assumption that the successful pairs are common enough. The work is aimed at researchers and practitioners in LLM agent security and web privacy who need page-level options. It engages honestly with the gap left by upstream defenses and reports both the strong controlled results and the variability in deployment conditions. The thinking is straightforward and the experiments are reproducible in principle, so it deserves a serious referee to press on the transfer evidence and ask for more detail on failure modes and coverage.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PIIGuard, a webpage-level defense that embeds optimized hidden HTML fragments to repurpose indirect prompt injection against contact-style PII harvesting by browsing-enabled LLMs. It reports defense success rates of at least 97% (often 100%) under rule-based and judge-based evaluation in direct-HTML injection on three models (GPT-5.4-nano, Claude-haiku-4.5, DeepSeek-chat), while preserving benign QA utility. In the harder public-URL browsing and attacker-side LLM sanitization settings, fragments remain effective for some model-position pairs but with substantial variation across interfaces and prompts. The optimization uses rule-based leakage scoring, evolutionary mutation, and judge-based recoverability assessment.

Significance. If the robustness in realistic settings can be quantified and strengthened, this provides a notable contribution by shifting PII mitigation to the page-owner layer, where prior work has focused on model/service/agent defenses. The empirical evaluation on external models, dual leakage metrics, and use of evolutionary search for defensive fragments are strengths that support falsifiable claims about page-side adversarial defenses.

major comments (2)

[Abstract] Abstract: The central practical-mitigation claim rests on the harder settings (public-URL browsing and attacker-side sanitization), yet these are reported only qualitatively as 'can remain effective for some model-position pairs' with 'robustness varies substantially,' without quantitative success rates, per-model breakdowns, or measured frequency of effective cases. This is load-bearing because the headline ≥97% rates apply only to direct-HTML injection.
[Evaluation] Evaluation sections: Exact sample sizes, precise definitions of the rule-based leakage scoring function, and failure-mode analysis (e.g., which PII types or query phrasings cause leakage) are not provided for the reported 97.0%+ rates. This prevents full verification of the empirical claims and assessment of whether the optimization pipeline generalizes beyond the direct-HTML training distribution.

minor comments (2)

[Abstract] Abstract: The distinction between direct-HTML results and the variable harder-setting results could be made more explicit in the opening summary sentence to avoid overstatement of deployability.
The evolutionary mutation and judge-based recoverability steps are described at a high level; adding pseudocode or parameter settings would improve reproducibility without altering the core claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We agree that clarifying the quantitative aspects of the harder evaluation settings and providing more methodological details will strengthen the paper. Below, we respond to each major comment and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: The central practical-mitigation claim rests on the harder settings (public-URL browsing and attacker-side sanitization), yet these are reported only qualitatively as 'can remain effective for some model-position pairs' with 'robustness varies substantially,' without quantitative success rates, per-model breakdowns, or measured frequency of effective cases. This is load-bearing because the headline ≥97% rates apply only to direct-HTML injection.

Authors: We acknowledge that the abstract highlights the strong results from direct-HTML injection while describing the public-URL browsing and attacker-side sanitization settings in more qualitative terms. This choice was made because the effectiveness in these settings shows substantial variation across different models, positions, interfaces, and sanitizer prompts, which makes a single headline number potentially misleading. However, we agree that providing quantitative metrics would better support the practical claims. In the revised manuscript, we will include specific success rates, per-model breakdowns, and the proportion of effective cases for the harder settings, drawing from the data we collected during evaluation. revision: yes
Referee: [Evaluation] Evaluation sections: Exact sample sizes, precise definitions of the rule-based leakage scoring function, and failure-mode analysis (e.g., which PII types or query phrasings cause leakage) are not provided for the reported 97.0%+ rates. This prevents full verification of the empirical claims and assessment of whether the optimization pipeline generalizes beyond the direct-HTML training distribution.

Authors: We agree that these details are necessary for full reproducibility and verification. We will add the exact sample sizes used in our experiments, the precise definition of the rule-based leakage scoring function, and a detailed failure-mode analysis breaking down cases by PII type and query phrasing to the evaluation section in the revision. This will allow readers to verify the claims and assess generalization of the optimization pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical optimization and evaluation

full rationale

The paper describes an optimization procedure (rule-based leakage scoring, evolutionary mutation, judge-based assessment) to generate defensive HTML fragments and then reports measured defense success rates on three external target models under direct-HTML injection, plus qualitative observations on two harder deployment settings. No equations, uniqueness theorems, or self-citations are invoked to derive the central claims; the reported percentages are direct experimental outcomes rather than quantities forced by fitting or redefinition. The work is therefore self-contained against external benchmarks and contains no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the optimization process is described at a high level without disclosed constants or assumptions beyond standard LLM behavior.

pith-pipeline@v0.9.0 · 5566 in / 1142 out tokens · 46737 ms · 2026-05-08T17:46:32.533817+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation / Foundation.AlphaCoordinateFixation washburn_uniqueness_aczel; J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

U(θ) = 2 µ_0(θ) + (1 − ℓ̄_R(θ)) + 0.25 µ_{0.25}(θ) ... The coefficients (2, 1, 0.25) encode a priority ordering
Foundation (whole forcing chain has zero adjustable parameters) reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Default optimizer hyperparameters: T=10, ϵ=0.15, D=3, |Z_0|=20, |S_score|=80, |E_eval|=100

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 12 canonical work pages · 2 internal anchors

[1]

Anthropic: Claude haiku 4.5.https://www.anthropic.com/claude/haiku(2025), accessed: 2026-03-15

2025
[2]

Bæk, D.H.: Does chatgpt and ai crawlers read javascript?https://seo.ai/blog/ does-chatgpt-and-ai-crawlers-read-javascript(2023), accessed: 2025-06-07

2023
[3]

CoRR abs/2502.20383(2025) PIIGuard: Mitigating PII Harvesting under Adversarial Sanitization 17

Chiang, J.Y.F., Lee, S., Huang, J., Huang, F., Chen, Y.: Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis. CoRR abs/2502.20383(2025) PIIGuard: Mitigating PII Harvesting under Adversarial Sanitization 17

work page arXiv 2025
[4]

DeepSeek-AI: Deepseek-v3.2: Pushing the frontier of open large language mod- els (12 2025),https://huggingface.co/deepseek-ai/DeepSeek-V3.2/resolve/ main/assets/paper.pdf, accessed: 2026-04-21

2025
[5]

In: 34th USENIX Security Symposium (USENIX Security 25)

Kim, H., Song, M., Na, S.H., Shin, S., Lee, K.: When{LLMs}go online: The emerging threat of{Web-Enabled}{LLMs}. In: 34th USENIX Security Symposium (USENIX Security 25). pp. 1729–1748 (2025)

2025
[6]

Lee, J., Park, G.: AutoGuard: AI Kill Switch for Malicious Web-based LLM Agents (2026),https://arxiv.org/abs/2511.13725

work page arXiv 2026
[7]

In: IEEE S&P (2026),https://github.com/ LetterLiGo/Agent-webcloak

Li, X., et al.: WebCloak: Characterizing and mitigating threats from LLM-driven web agents as intelligent scrapers. In: IEEE S&P (2026),https://github.com/ LetterLiGo/Agent-webcloak

2026
[8]

arXiv preprint arXiv:2409.11295 , year=

Liao, Z., et al.: EIA: Environmental injection attack on generalist web agents for privacy leakage. In: ICLR (2025),https://arxiv.org/abs/2409.11295

work page arXiv 2025
[9]

arXiv preprint arXiv:2502.04951 (2025)

Luo, Z., Peng, Z., Liu, Y., Sun, Z., Li, M., Zheng, J., He, X.: Unsafe LLM-Based Search: Quantitative Analysis and Mitigation of Safety Risks in AI Web Search. arXiv preprint arXiv:2502.04951 (2025)

work page arXiv 2025
[10]

WebGPT: Browser-assisted question-answering with human feedback

Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., Jiang, X., Cobbe, K., Eloundou, T., Krueger, G., Button, K., Knight, M., Chess, B., Schulman, J.: WebGPT: Browser-assisted Question-answering with Human Feedback. CoRRabs/2112.09332(2021)

work page internal anchor Pith review arXiv 2021
[11]

com/api/docs/models/gpt-5.4-mini

OpenAI: Gpt-5.4 mini model | openai api (2026),https://developers.openai. com/api/docs/models/gpt-5.4-mini

2026
[12]

OpenAI: Gpt-5.4 nano model | openai api.https://developers.openai.com/api/ docs/models/gpt-5.4-nano(2026), accessed: 2026-03-15

2026
[13]

In: Proceedings of the 2nd Workshop on Representation Learning for NLP, Rep4NLP@ACL 2017, Vancouver, Canada, August 3, 2017

Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., Sule- man, K.: Newsqa: A machine comprehension dataset. In: Proceedings of the 2nd Workshop on Representation Learning for NLP, Rep4NLP@ACL 2017, Vancouver, Canada, August 3, 2017. pp. 191–200. Association for Computational Linguistics (2017)

2017
[14]

com/docs/functions, accessed: 2026-04-21

Vercel Inc.: Vercel documentation: Serverless functions (2026),https://vercel. com/docs/functions, accessed: 2026-04-21

2026
[15]

ACM Comput

Wang, M., Zhang, Y., Yu, B., Hao, B., Peng, C., Chen, Y., Zhou, W., Gu, J., Zhuang, C., Guo, R., Wang, W., Zhao, X.: Function calling in large language models: Industrial practices, challenges, and future directions. ACM Comput. Surv.58(9) (Feb 2026).https://doi.org/10.1145/3788284,https://doi.org/ 10.1145/3788284

work page doi:10.1145/3788284 2026
[16]

BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

Wei, J., Sun, Z., Papay, S., McKinney, S., Han, J., Fulford, I., Chung, H.W., Passos, A.T., Fedus, W., Glaese, A.: Browsecomp: A simple yet challenging benchmark for browsing agents. CoRRabs/2504.12516(2025)

work page internal anchor Pith review arXiv 2025
[17]

Xu, W., Zhu, G., Zhao, X., Pan, L., Li, L., Wang, W.: Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement. In: ACL. pp. 15474–15492 (2024).https: //doi.org/10.18653/V1/2024.ACL-LONG.826

work page doi:10.18653/v1/2024.acl-long.826 2024
[18]

Zhang, H

Yi, J., Xie, Y., Zhu, B., Kiciman, E., Sun, G., Xie, X., Wu, F.: Benchmarking and Defending against Indirect Prompt Injection Attacks on Large Language Models. In:Proceedingsofthe31stACMSIGKDDConferenceonKnowledgeDiscoveryand Data Mining. pp. 1809–1820. ACM (2025),https://doi.org/10.1145/3690624. 3709179

work page doi:10.1145/3690624 2025
[19]

Automated privacy information annotation in large language model interactions.arXiv preprint arXiv:2505.20910, 2025

Zeng, H., Liu, X., Hu, Y., Niu, C., Wu, F., Tang, S., Chen, G.: Automated Privacy Information Annotation in Large Language Model Interactions. CoRR abs/2505.20910(2025) 18 Liu, Zha et al

work page arXiv 2025
[20]

Zhong, P.Y., Chen, S., Wang, R., McCall, M., Titzer, B.L., Miller, H., Gibbons, P.B.: RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leak- age (Feb 2025).https://doi.org/10.48550/arXiv.2502.08966

work page doi:10.48550/arxiv.2502.08966 2025
[21]

Melon: Provable defense against indirect prompt injection attacks in ai agents.arXiv preprint arXiv:2502.05174, 2025

Zhu, K., Yang, X., Wang, J., Guo, W., Wang, W.Y.: MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison (Feb 2025). https://doi.org/10.48550/arXiv.2502.05174 A Detailed Experimental Setup This appendix collects implementation details that would interrupt the narrative flow of Section 5.1: hyperparameters for the optimizer an...

work page doi:10.48550/arxiv.2502.05174 2025