pith. machine review for the scientific record. sign in

arxiv: 2605.03129 · v1 · submitted 2026-05-04 · 💻 cs.CR · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

PIIGuard: Mitigating PII Harvesting under Adversarial Sanitization

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:46 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL
keywords PII protectionLLM securitywebpage defenseprompt injectionpersonally identifiable informationadversarial sanitizationindirect prompt injectionweb scraping mitigation
0
0 comments X

The pith

Webpage owners can insert hidden HTML fragments to prevent LLM assistants from harvesting contact-style personally identifiable information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that ordinary webpage owners can protect contact-style personally identifiable information from being scraped by browsing-enabled large language models. It does this by embedding specially designed hidden HTML fragments that use indirect prompt injection to steer models away from verbatim or reconstructible disclosure. The fragments are optimized through rule-based leakage scoring, evolutionary mutation, and judge-based assessment of recoverability. Direct evaluation on three models shows defense success rates of at least 97 percent, often 100 percent, while normal same-page question answering remains functional. The method also shows partial robustness when pages are accessed via public URLs and subjected to attacker-side sanitization, though results vary by interface and prompt.

Core claim

By searching over fragment text and insertion position, PIIGuard produces webpage-level defensive fragments that achieve at least 97 percent defense success rate against contact PII leakage on tested models under rule-based and judge-based evaluation, often reaching 100 percent, while preserving benign utility; the fragments remain effective for some model-position pairs under public-URL browsing and attacker-side LLM sanitization.

What carries the argument

Optimized hidden HTML fragments inserted via indirect prompt injection to steer LLMs away from PII disclosure

If this is right

  • Page owners gain a deployable, independent tool to reduce PII leakage without depending on model or service providers.
  • Normal question-answering utility on the protected page stays intact for legitimate users.
  • The same fragments can block both direct and some sanitized or browsed access paths for certain model combinations.
  • Rule-based scoring combined with final judge evaluation reliably identifies effective fragments during optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread adoption could lower the overall volume of contact PII that LLM assistants surface from public web pages.
  • Attackers may develop stronger sanitizers specifically tuned to detect and strip defensive fragments.
  • The fragment technique could be adapted and tested for protecting other categories of sensitive information on webpages.
  • Further checks could measure whether the hidden fragments affect page rendering or accessibility for ordinary human visitors.

Load-bearing premise

The defensive fragments continue to influence the model's output even after the webpage content is fetched through public browsing interfaces and possibly cleaned by attacker-side sanitizers.

What would settle it

A model successfully extracting and outputting the contact PII after the page has been accessed via public URL and sanitized by an attacker LLM would falsify the claim of practical mitigation.

Figures

Figures reproduced from arXiv: 2605.03129 by Min Chen, Mingshuo Liu, Yiwei Zha.

Figure 1
Figure 1. Figure 1: The pipeline demonstrates how attacker utilize modern LLM systems view at source ↗
Figure 2
Figure 2. Figure 2: Overview of PIIGuard. Phase 1: leakage assessment for initial seed frag view at source ↗
read the original abstract

Browsing-enabled LLM assistants can fetch webpages and answer contact-seeking queries, creating a practical channel for scraping contact-style personally identifiable information (PII) from public pages. Many prior defenses are deployed at the model, service, or agent layer rather than at the webpage itself, leaving ordinary page owners with limited deployable options. We present PIIGuard, a webpage-level defense that repurposes indirect prompt injection as a protective mechanism: the page owner embeds optimized hidden HTML fragments that steer the model away from verbatim or reconstructible disclosure of contact PII. PIIGuard searches over fragment text and insertion position using rule-based leakage scoring, evolutionary mutation, and final judge-based recoverability assessment. In direct-HTML evaluation on three target models (GPT-5.4-nano, Claude-haiku-4.5, and DeepSeek-chat(latest v3.2)), PIIGuard achieves at least 97.0% defense success rate under both rule-based and judge-based leakage evaluation, often reaching 100.0%, while preserving benign same-page QA utility. We further evaluate two harder settings: public-URL browsing and attacker-side LLM sanitization of fetched webpage. These results show that page-side defensive fragments can remain effective in deployment for some model-position pairs, but robustness varies substantially across browsing interfaces and sanitizer prompts. Overall, PIIGuard demonstrates that page owners can use page-side fragments as a practical mitigation for web-grounded PII leakage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PIIGuard, a webpage-level defense that embeds optimized hidden HTML fragments to repurpose indirect prompt injection against contact-style PII harvesting by browsing-enabled LLMs. It reports defense success rates of at least 97% (often 100%) under rule-based and judge-based evaluation in direct-HTML injection on three models (GPT-5.4-nano, Claude-haiku-4.5, DeepSeek-chat), while preserving benign QA utility. In the harder public-URL browsing and attacker-side LLM sanitization settings, fragments remain effective for some model-position pairs but with substantial variation across interfaces and prompts. The optimization uses rule-based leakage scoring, evolutionary mutation, and judge-based recoverability assessment.

Significance. If the robustness in realistic settings can be quantified and strengthened, this provides a notable contribution by shifting PII mitigation to the page-owner layer, where prior work has focused on model/service/agent defenses. The empirical evaluation on external models, dual leakage metrics, and use of evolutionary search for defensive fragments are strengths that support falsifiable claims about page-side adversarial defenses.

major comments (2)
  1. [Abstract] Abstract: The central practical-mitigation claim rests on the harder settings (public-URL browsing and attacker-side sanitization), yet these are reported only qualitatively as 'can remain effective for some model-position pairs' with 'robustness varies substantially,' without quantitative success rates, per-model breakdowns, or measured frequency of effective cases. This is load-bearing because the headline ≥97% rates apply only to direct-HTML injection.
  2. [Evaluation] Evaluation sections: Exact sample sizes, precise definitions of the rule-based leakage scoring function, and failure-mode analysis (e.g., which PII types or query phrasings cause leakage) are not provided for the reported 97.0%+ rates. This prevents full verification of the empirical claims and assessment of whether the optimization pipeline generalizes beyond the direct-HTML training distribution.
minor comments (2)
  1. [Abstract] Abstract: The distinction between direct-HTML results and the variable harder-setting results could be made more explicit in the opening summary sentence to avoid overstatement of deployability.
  2. The evolutionary mutation and judge-based recoverability steps are described at a high level; adding pseudocode or parameter settings would improve reproducibility without altering the core claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We agree that clarifying the quantitative aspects of the harder evaluation settings and providing more methodological details will strengthen the paper. Below, we respond to each major comment and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central practical-mitigation claim rests on the harder settings (public-URL browsing and attacker-side sanitization), yet these are reported only qualitatively as 'can remain effective for some model-position pairs' with 'robustness varies substantially,' without quantitative success rates, per-model breakdowns, or measured frequency of effective cases. This is load-bearing because the headline ≥97% rates apply only to direct-HTML injection.

    Authors: We acknowledge that the abstract highlights the strong results from direct-HTML injection while describing the public-URL browsing and attacker-side sanitization settings in more qualitative terms. This choice was made because the effectiveness in these settings shows substantial variation across different models, positions, interfaces, and sanitizer prompts, which makes a single headline number potentially misleading. However, we agree that providing quantitative metrics would better support the practical claims. In the revised manuscript, we will include specific success rates, per-model breakdowns, and the proportion of effective cases for the harder settings, drawing from the data we collected during evaluation. revision: yes

  2. Referee: [Evaluation] Evaluation sections: Exact sample sizes, precise definitions of the rule-based leakage scoring function, and failure-mode analysis (e.g., which PII types or query phrasings cause leakage) are not provided for the reported 97.0%+ rates. This prevents full verification of the empirical claims and assessment of whether the optimization pipeline generalizes beyond the direct-HTML training distribution.

    Authors: We agree that these details are necessary for full reproducibility and verification. We will add the exact sample sizes used in our experiments, the precise definition of the rule-based leakage scoring function, and a detailed failure-mode analysis breaking down cases by PII type and query phrasing to the evaluation section in the revision. This will allow readers to verify the claims and assess generalization of the optimization pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical optimization and evaluation

full rationale

The paper describes an optimization procedure (rule-based leakage scoring, evolutionary mutation, judge-based assessment) to generate defensive HTML fragments and then reports measured defense success rates on three external target models under direct-HTML injection, plus qualitative observations on two harder deployment settings. No equations, uniqueness theorems, or self-citations are invoked to derive the central claims; the reported percentages are direct experimental outcomes rather than quantities forced by fitting or redefinition. The work is therefore self-contained against external benchmarks and contains no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the optimization process is described at a high level without disclosed constants or assumptions beyond standard LLM behavior.

pith-pipeline@v0.9.0 · 5566 in / 1142 out tokens · 46737 ms · 2026-05-08T17:46:32.533817+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 12 canonical work pages · 2 internal anchors

  1. [1]

    Anthropic: Claude haiku 4.5.https://www.anthropic.com/claude/haiku(2025), accessed: 2026-03-15

  2. [2]

    Bæk, D.H.: Does chatgpt and ai crawlers read javascript?https://seo.ai/blog/ does-chatgpt-and-ai-crawlers-read-javascript(2023), accessed: 2025-06-07

  3. [3]

    CoRR abs/2502.20383(2025) PIIGuard: Mitigating PII Harvesting under Adversarial Sanitization 17

    Chiang, J.Y.F., Lee, S., Huang, J., Huang, F., Chen, Y.: Why Are Web AI Agents More Vulnerable Than Standalone LLMs? A Security Analysis. CoRR abs/2502.20383(2025) PIIGuard: Mitigating PII Harvesting under Adversarial Sanitization 17

  4. [4]

    DeepSeek-AI: Deepseek-v3.2: Pushing the frontier of open large language mod- els (12 2025),https://huggingface.co/deepseek-ai/DeepSeek-V3.2/resolve/ main/assets/paper.pdf, accessed: 2026-04-21

  5. [5]

    In: 34th USENIX Security Symposium (USENIX Security 25)

    Kim, H., Song, M., Na, S.H., Shin, S., Lee, K.: When{LLMs}go online: The emerging threat of{Web-Enabled}{LLMs}. In: 34th USENIX Security Symposium (USENIX Security 25). pp. 1729–1748 (2025)

  6. [6]

    Lee, J., Park, G.: AutoGuard: AI Kill Switch for Malicious Web-based LLM Agents (2026),https://arxiv.org/abs/2511.13725

  7. [7]

    In: IEEE S&P (2026),https://github.com/ LetterLiGo/Agent-webcloak

    Li, X., et al.: WebCloak: Characterizing and mitigating threats from LLM-driven web agents as intelligent scrapers. In: IEEE S&P (2026),https://github.com/ LetterLiGo/Agent-webcloak

  8. [8]

    arXiv preprint arXiv:2409.11295 , year=

    Liao, Z., et al.: EIA: Environmental injection attack on generalist web agents for privacy leakage. In: ICLR (2025),https://arxiv.org/abs/2409.11295

  9. [9]

    arXiv preprint arXiv:2502.04951 (2025)

    Luo, Z., Peng, Z., Liu, Y., Sun, Z., Li, M., Zheng, J., He, X.: Unsafe LLM-Based Search: Quantitative Analysis and Mitigation of Safety Risks in AI Web Search. arXiv preprint arXiv:2502.04951 (2025)

  10. [10]

    WebGPT: Browser-assisted question-answering with human feedback

    Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., Jiang, X., Cobbe, K., Eloundou, T., Krueger, G., Button, K., Knight, M., Chess, B., Schulman, J.: WebGPT: Browser-assisted Question-answering with Human Feedback. CoRRabs/2112.09332(2021)

  11. [11]

    com/api/docs/models/gpt-5.4-mini

    OpenAI: Gpt-5.4 mini model | openai api (2026),https://developers.openai. com/api/docs/models/gpt-5.4-mini

  12. [12]

    OpenAI: Gpt-5.4 nano model | openai api.https://developers.openai.com/api/ docs/models/gpt-5.4-nano(2026), accessed: 2026-03-15

  13. [13]

    In: Proceedings of the 2nd Workshop on Representation Learning for NLP, Rep4NLP@ACL 2017, Vancouver, Canada, August 3, 2017

    Trischler, A., Wang, T., Yuan, X., Harris, J., Sordoni, A., Bachman, P., Sule- man, K.: Newsqa: A machine comprehension dataset. In: Proceedings of the 2nd Workshop on Representation Learning for NLP, Rep4NLP@ACL 2017, Vancouver, Canada, August 3, 2017. pp. 191–200. Association for Computational Linguistics (2017)

  14. [14]

    com/docs/functions, accessed: 2026-04-21

    Vercel Inc.: Vercel documentation: Serverless functions (2026),https://vercel. com/docs/functions, accessed: 2026-04-21

  15. [15]

    ACM Comput

    Wang, M., Zhang, Y., Yu, B., Hao, B., Peng, C., Chen, Y., Zhou, W., Gu, J., Zhuang, C., Guo, R., Wang, W., Zhao, X.: Function calling in large language models: Industrial practices, challenges, and future directions. ACM Comput. Surv.58(9) (Feb 2026).https://doi.org/10.1145/3788284,https://doi.org/ 10.1145/3788284

  16. [16]

    BrowseComp: A Simple Yet Challenging Benchmark for Browsing Agents

    Wei, J., Sun, Z., Papay, S., McKinney, S., Han, J., Fulford, I., Chung, H.W., Passos, A.T., Fedus, W., Glaese, A.: Browsecomp: A simple yet challenging benchmark for browsing agents. CoRRabs/2504.12516(2025)

  17. [17]

    Xu, W., Zhu, G., Zhao, X., Pan, L., Li, L., Wang, W.: Pride and Prejudice: LLM Amplifies Self-Bias in Self-Refinement. In: ACL. pp. 15474–15492 (2024).https: //doi.org/10.18653/V1/2024.ACL-LONG.826

  18. [18]

    Zhang, H

    Yi, J., Xie, Y., Zhu, B., Kiciman, E., Sun, G., Xie, X., Wu, F.: Benchmarking and Defending against Indirect Prompt Injection Attacks on Large Language Models. In:Proceedingsofthe31stACMSIGKDDConferenceonKnowledgeDiscoveryand Data Mining. pp. 1809–1820. ACM (2025),https://doi.org/10.1145/3690624. 3709179

  19. [19]

    Automated privacy information annotation in large language model interactions.arXiv preprint arXiv:2505.20910, 2025

    Zeng, H., Liu, X., Hu, Y., Niu, C., Wu, F., Tang, S., Chen, G.: Automated Privacy Information Annotation in Large Language Model Interactions. CoRR abs/2505.20910(2025) 18 Liu, Zha et al

  20. [20]

    Zhong, P.Y., Chen, S., Wang, R., McCall, M., Titzer, B.L., Miller, H., Gibbons, P.B.: RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leak- age (Feb 2025).https://doi.org/10.48550/arXiv.2502.08966

  21. [21]

    Melon: Provable defense against indirect prompt injection attacks in ai agents.arXiv preprint arXiv:2502.05174, 2025

    Zhu, K., Yang, X., Wang, J., Guo, W., Wang, W.Y.: MELON: Indirect Prompt Injection Defense via Masked Re-execution and Tool Comparison (Feb 2025). https://doi.org/10.48550/arXiv.2502.05174 A Detailed Experimental Setup This appendix collects implementation details that would interrupt the narrative flow of Section 5.1: hyperparameters for the optimizer an...