arxiv: 2605.06305 · v1 · submitted 2026-05-07 · 💻 cs.AI · cs.IR

Addressing Labelled Data Scarcity: Taxonomy-Agnostic Annotation of PII Values in HTTP Traffic using LLMs

Thomas Cory , Axel K\"upper This is my paper

Pith reviewed 2026-05-08 09:55 UTC · model grok-4.3

classification 💻 cs.AI cs.IR

keywords PII annotationHTTP trafficLLM pipelineprivacy auditingsynthetic data generationtaxonomy agnosticpersonally identifiable informationdata leakage detection

0 comments

The pith

LLMs can annotate PII values in HTTP traffic when any taxonomy is supplied at runtime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the shortage of labeled data for detecting personally identifiable information leaks in web and app traffic. Current detectors need large amounts of manual labels and stay locked to one fixed set of PII categories. The authors present a multi-stage pipeline that preprocesses HTTP messages, classifies which PII types appear based on a runtime taxonomy, extracts the matching values, and checks the output. To test the system without exposing real user data, they also built an LLM generator that creates synthetic HTTP traffic carrying known, taxonomy-derived PII labels. Evaluation across three taxonomies of varying domains and detail shows the pipeline detects types and extracts values with good accuracy.

Core claim

The authors introduce a multi-stage LLM-based pipeline for taxonomy-agnostic annotation of explicitly transmitted PII values in HTTP message bodies, where the taxonomy is provided at runtime. The pipeline integrates deterministic pre-processing, label-level classification, targeted instance-level value annotation, and output validation. To support evaluation, they develop an LLM-based generator for synthetic HTTP traffic with manually validated PII annotations derived from taxonomies. Evaluation across three taxonomies demonstrates accurate detection of PII types and extraction of corresponding values.

What carries the argument

A multi-stage LLM pipeline that performs deterministic pre-processing, runtime taxonomy-guided label classification, instance-level value annotation, and output validation, backed by an LLM generator for synthetic annotated HTTP traffic.

If this is right

Privacy audit systems can switch between different PII taxonomies without retraining or relabeling data.
The need for large manually labeled traffic datasets shrinks because the pipeline generates its own annotations.
Labeled synthetic traffic can be produced on demand for any new or evolving privacy taxonomy.
Annotation works across taxonomies that differ in domain coverage and level of detail.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the synthetic generator matches real traffic distributions, the same pipeline could bootstrap training sets for smaller specialized detectors.
The method could extend beyond HTTP to other network flows once similar synthetic generators are built for them.
Real-time privacy monitors might adopt the pipeline to keep pace with changing regulations without periodic full retraining.

Load-bearing premise

The LLM pipeline will stay accurate and low-hallucination when run on real HTTP traffic whose structure and content differ from the synthetic examples used for testing.

What would settle it

Applying the pipeline to a set of real captured HTTP traffic and measuring a large drop in PII type detection accuracy or value extraction correctness relative to the synthetic results.

Figures

Figures reproduced from arXiv: 2605.06305 by Axel K\"upper, Thomas Cory.

**Figure 1.** Figure 1: Overview of the proposed multi-stage annotation pipeline. The pipeline normalises HTTP message bodies, performs label-level classification view at source ↗

**Figure 2.** Figure 2: Overview of the LLM harness used within each pipeline view at source ↗

**Figure 3.** Figure 3: Overview of the synthetic HTTP message generator. Starting from a label taxonomy, the pipeline generates scenario templates, instantiates view at source ↗

read the original abstract

Automated privacy audits of web and mobile applications often analyse outbound HTTP traffic to detect Personally Identifiable Information (PII) leakage. However, existing learning-based detectors typically depend on scarce, manually labelled traffic and are tightly coupled to fixed label taxonomies, limiting transferability across domains and evolving definitions of PII. This paper investigates whether Large Language Models (LLMs) can support taxonomy-agnostic annotation of explicitly transmitted PII values in HTTP message bodies when the taxonomy is provided at runtime. We introduce a multi-stage LLM-based pipeline that combines deterministic pre-processing with label-level classification, targeted instance-level value annotation, and output validation. To enable controlled evaluation and exemplar-based prompting without relying on sensitive real-user captures, we further propose an LLM-based generator for synthetic HTTP traffic with manually validated, taxonomy-derived PII annotations. We evaluate the approach across three taxonomies spanning different PII domains and granularity levels. Results show that the pipeline accurately detects PII types and extracts corresponding values for concrete PII taxonomies. Overall, our findings position LLMs as a promising foundation for flexible, taxonomy-agnostic traffic annotation and for creating labelled data under evolving privacy taxonomies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical LLM pipeline for runtime taxonomy-agnostic PII labeling in HTTP traffic plus a synthetic generator, but all results stay inside that synthetic data.

read the letter

The paper's main takeaway is a multi-stage LLM pipeline that annotates PII in HTTP traffic in a taxonomy-agnostic manner at runtime, paired with an LLM generator for synthetic labeled traffic. This setup aims to ease the labeled data problem for privacy audits. What stands out as new is the integration of deterministic pre-processing steps with LLM-based classification and value extraction, plus the synthetic traffic generator that produces examples with validated annotations. The paper shows how this can work across three different taxonomies. It does a solid job describing the pipeline components and explaining why synthetic data helps avoid privacy issues with real captures. The approach is practical for creating training data under changing PII definitions. The main soft spot is the evaluation. All tests use the synthetic HTTP traffic from their generator, with no reported results on actual outbound traces from real apps. As the stress-test notes, real traffic has irregular structures, encodings, and noise that the synthetic version might not fully replicate. This makes it unclear whether the reported accuracy would hold in deployment. The paper is aimed at people building privacy auditing tools or working on traffic analysis in security and privacy. A reader looking for ideas on using LLMs for domain-specific annotation tasks would find the pipeline details useful. It deserves a serious referee because the core idea is sound and the synthetic generator is a concrete contribution, though it needs more grounding in real data to be fully convincing. I would recommend sending it for peer review, with feedback focused on adding validation against real HTTP captures.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a multi-stage LLM pipeline for taxonomy-agnostic annotation of explicitly transmitted PII values in HTTP message bodies, supported by a companion LLM-based generator for synthetic HTTP traffic with manually validated annotations. It evaluates the pipeline across three PII taxonomies of varying domains and granularity, claiming that the approach accurately detects PII types and extracts corresponding values.

Significance. If the accuracy and low-hallucination claims hold, the work could meaningfully address labeled-data scarcity for privacy audits by enabling flexible, runtime-provided taxonomy annotation rather than fixed classifiers. The synthetic generator is a concrete strength, as it supports controlled, reproducible exemplar-based prompting without exposing real user data.

major comments (2)

[§4] §4 (Evaluation setup): The reported results are obtained exclusively on synthetic HTTP traffic produced by the paper's own LLM generator. No experiments on real outbound HTTP traces are described, yet the abstract and conclusion assert practical utility for privacy audits of web and mobile applications. Differences in JSON nesting, encoding (base64/URL), mixed content types, and contextual noise between synthetic and real distributions could affect transfer; this directly bears on whether the taxonomy-agnostic positioning is supported.
[Abstract and §4] Abstract and §4: The claim that the pipeline 'accurately detects PII types and extracts corresponding values' is stated without any quantitative metrics (precision, recall, F1, error rates, number of test instances per taxonomy, or baseline comparisons). This absence prevents assessment of effect size and undermines the strength of the central empirical claim.

minor comments (1)

[§3] The pipeline diagram or pseudocode in §3 would improve clarity of the deterministic pre-processing, label-level classification, instance-level annotation, and validation stages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [§4] §4 (Evaluation setup): The reported results are obtained exclusively on synthetic HTTP traffic produced by the paper's own LLM generator. No experiments on real outbound HTTP traces are described, yet the abstract and conclusion assert practical utility for privacy audits of web and mobile applications. Differences in JSON nesting, encoding (base64/URL), mixed content types, and contextual noise between synthetic and real distributions could affect transfer; this directly bears on whether the taxonomy-agnostic positioning is supported.

Authors: We agree that evaluation on real outbound traces would provide stronger evidence of transferability. The synthetic generator was chosen specifically to support controlled, reproducible experiments and exemplar-based prompting while avoiding exposure of real user data. It incorporates controlled variations in nesting, encodings, and mixed content to approximate real distributions. In the revised manuscript we will add an expanded discussion of generator fidelity and a dedicated Limitations section that explicitly addresses potential domain shifts, contextual noise, and the need for future validation on consented or anonymized real traces. revision: partial
Referee: [Abstract and §4] Abstract and §4: The claim that the pipeline 'accurately detects PII types and extracts corresponding values' is stated without any quantitative metrics (precision, recall, F1, error rates, number of test instances per taxonomy, or baseline comparisons). This absence prevents assessment of effect size and undermines the strength of the central empirical claim.

Authors: We acknowledge that the abstract and the high-level summary in §4 do not present the quantitative metrics, which weakens the empirical claims as noted. The evaluation section does contain per-taxonomy results, but these are not sufficiently highlighted or summarized with standard metrics and baselines. We will revise the abstract to include key quantitative findings (e.g., average and per-taxonomy F1 scores) and expand §4 with explicit tables reporting precision, recall, F1, number of test instances, error rates, and baseline comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical pipeline with independent evaluation setup

full rationale

The paper presents an empirical multi-stage LLM pipeline for taxonomy-agnostic PII annotation in HTTP traffic and a separate LLM-based generator for creating synthetic labeled examples with manual validation. Evaluation results are reported on this synthetic data across three taxonomies, but no mathematical derivations, equations, fitted parameters, or self-referential reductions exist. The generator is proposed explicitly to avoid real-user data sensitivity and is not claimed to be derived from the annotation pipeline itself. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing manner. The work is self-contained as a practical pipeline description rather than a closed theoretical derivation, consistent with a non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unstated assumption that LLMs can perform reliable instance-level value extraction from HTTP bodies when given only a taxonomy description; no free parameters, axioms, or invented entities are explicitly introduced in the abstract.

pith-pipeline@v0.9.0 · 5508 in / 1085 out tokens · 38971 ms · 2026-05-08T09:55:04.771921+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 5 canonical work pages

[1]

Datasets at hugging face

ai4privacy/pii Masking-200k. Datasets at hugging face. https: //huggingface.co/datasets/ai4privacy/pii-masking-200k, 2023. [Ac- cessed 26-02-2026]

2023
[2]

Generating multi-label discrete patient records using generative adversarial networks.Machine learning for healthcare conference, pages 286–305, 2017

Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Wal- ter F Stewart, and Jimeng Sun. Generating multi-label discrete patient records using generative adversarial networks.Machine learning for healthcare conference, pages 286–305, 2017

2017
[3]

A qualitative analysis framework for mhealth privacy practices

Thomas Cory, Wolf Rieder, and Thu-My Huynh. A qualitative analysis framework for mhealth privacy practices. In2024 IEEE European Symposium on Security and Privacy Workshops (Eu- roS&PW), pages 24–31. IEEE, 2024

2024
[4]

Word-level annotation of gdpr trans- parency compliance in privacy policies using large language mod- els.Proceedings on Privacy Enhancing Technologies, 1:509–528, 2026

Thomas Cory, Wolf Rieder, Julia Kr ¨amer, Philip Raschke, Patrick Herbke, and Axel K ¨upper. Word-level annotation of gdpr trans- parency compliance in privacy policies using large language mod- els.Proceedings on Privacy Enhancing Technologies, 1:509–528, 2026

2026
[5]

General Data Protection Regulation (GDPR) – Articles 13 and 14, 2016

European Parliament and Council of the European Union. General Data Protection Regulation (GDPR) – Articles 13 and 14, 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data

2016
[6]

Applying and sharing pre-trained bert-models for named entity recognition and classifi- cation in swedish electronic patient records

Mila Grancharova and Hercules Dalianis. Applying and sharing pre-trained bert-models for named entity recognition and classifi- cation in swedish electronic patient records. InProceedings of the 23rd nordic conference on computational linguistics (NoDaLiDa), pages 231–239, 2021

2021
[7]

Pate- gan: Generating synthetic data with differential privacy guarantees

James Jordon, Jinsung Yoon, and Mihaela Van Der Schaar. Pate- gan: Generating synthetic data with differential privacy guarantees. InInternational conference on learning representations, 2018

2018
[8]

Learning to detect pii: Tabular vs

Rishika Kohli, Shaifu Gupta, Manoj Singh Gaur, and Soma S Dhavala. Learning to detect pii: Tabular vs. document classification models for network traffic analysis.Journal of Information Security and Applications, 94:104196, 2025

2025
[9]

Antmonitor: A system for monitoring from mobile devices

Anh Le, Janus Varmarken, Simon Langhoff, Anastasia Shuba, Minas Gjoka, and Athina Markopoulou. Antmonitor: A system for monitoring from mobile devices. InProceedings of the 2015 ACM SIGCOMM Workshop on Crowdsourcing and Crowdsharing of Big (Internet) Data, pages 15–20, 2015

2015
[10]

Mitigating bias in recruitment: A practical approach to cv de-identification considering privacy sensitive information

Sascha L ¨obner, Jetzabel Serna, Fr ´ed´eric Tronnier, Welderufael Tesfay, and Kai Rannenberg. Mitigating bias in recruitment: A practical approach to cv de-identification considering privacy sensitive information. InInternational Conference on Availability, Reliability and Security, pages 174–192. Springer, 2025

2025
[11]

Detecting personally identifiable information through natural language processing: A step forward

Luca Mainetti and Andrea Elia. Detecting personally identifiable information through natural language processing: A step forward. Applied System Innovation, 8(2):55, 2025

2025
[12]

Scalable multilingual pii an- notation for responsible ai in llms

Bharti Meena, Joanna Skubisz, Harshit Rajgarhia, Nand Dave, Kiran Ganesh, Shivali Dalmia, Abhishek Mukherji, Vasudevan Sundarababu, and Olga Pospelova. Scalable multilingual pii an- notation for responsible ai in llms. In2025 IEEE International Conference on Data Mining Workshops (ICDMW), pages 367–375. IEEE, 2025

2025
[13]

Google’s data types for DSS

Google Play. Provide information for google play’s data safety section. https://support.google.com/googleplay/android-developer/ answer/10787469#zippy=%2Cdata-types, 2021. [Accessed 26-02- 2026]

work page arXiv 2021
[14]

Haystack: A multi-purpose mobile vantage point in user space.arXiv preprint arXiv:1510.01419, 2015

Abbas Razaghpanah, Narseo Vallina-Rodriguez, Srikanth Sundare- san, Christian Kreibich, Phillipa Gill, Mark Allman, and Vern Paxson. Haystack: A multi-purpose mobile vantage point in user space.arXiv preprint arXiv:1510.01419, 2015

work page arXiv 2015
[15]

Anat Reiner Benaim, Ronit Almog, Yuri Gorelik, Irit Hochberg, Laila Nassar, Tanya Mashiach, Mogher Khamaisi, Yael Lurie, Zaher S Azzam, Johad Khoury, et al. Analyzing medical research results based on synthetic data and their relation to real data results: systematic comparison from five observational studies.JMIR medical informatics, 8(2):e16492, 2020

2020
[16]

A longitudinal study of pii leaks across android app versions

Jingjing Ren, Martina Lindorfer, Daniel J Dubois, Ashwin Rao, David Choffnes, and Narseo Vallina-Rodriguez. A longitudinal study of pii leaks across android app versions. InNetwork and Distributed System Security Symposium (NDSS), volume 10, 2018

2018
[17]

Recon: Revealing and controlling pii leaks in mobile network traffic

Jingjing Ren, Ashwin Rao, Martina Lindorfer, Arnaud Legout, and David Choffnes. Recon: Revealing and controlling pii leaks in mobile network traffic. InProceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, pages 361–374, 2016

2016
[18]

Spy: Enhanc- ing privacy with synthetic pii detection dataset

Maksim Savkin, Timur Ionov, and Vasily Konovalov. Spy: Enhanc- ing privacy with synthetic pii detection dataset. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 4: Student Research Workshop), pages 236– 246, 2025

2025
[19]

Llm-based syn- thetic datasets: Applications and limitations in toxicity detection

Maximilian Schmidhuber and Udo Kruschwitz. Llm-based syn- thetic datasets: Applications and limitations in toxicity detection. InProceedings of the F ourth Workshop on Threat, Aggression & Cyberbullying@ LREC-COLING-2024, pages 37–51, 2024

2024
[20]

Pii-bench: Evaluating query-aware privacy protection systems.arXiv preprint arXiv:2502.18545, 2025

Hao Shen, Zhouhong Gu, Haokai Hong, and Weili Han. Pii-bench: Evaluating query-aware privacy protection systems.arXiv preprint arXiv:2502.18545, 2025

work page arXiv 2025
[21]

Privacyguard: A vpn-based platform to detect information leakage on android devices

Yihang Song and Urs Hengartner. Privacyguard: A vpn-based platform to detect information leakage on android devices. In Proceedings of the 5th Annual ACM CCS Workshop on Security and Privacy in Smartphones and Mobile Devices, pages 15–26, 2015

2015
[22]

Privacyproxy: Leveraging crowdsourcing and in situ traffic analysis to detect and mitigate information leakage.arXiv preprint arXiv:1708.06384, 2017

Gaurav Srivastava, Kunal Bhuwalka, Swarup Kumar Sahoo, Sak- sham Chitkara, Kevin Ku, Matt Fredrikson, Jason Hong, and Yuvraj Agarwal. Privacyproxy: Leveraging crowdsourcing and in situ traffic analysis to detect and mitigate information leakage.arXiv preprint arXiv:1708.06384, 2017

work page arXiv 2017
[23]

California consumer privacy act of 2018, AB

State of California. California consumer privacy act of 2018, AB

2018
[24]

Accessed: 2026-02- 27

California Legislative Information, 2018. Accessed: 2026-02- 27

2018
[25]

The devil’s in the details: the detailedness of classes influences personal information detection and labeling

Maria Irena Szawerna, Simon Dobnik, Ricardo Munoz S ´anchez, and Elena V olodina. The devil’s in the details: the detailedness of classes influences personal information detection and labeling. InProceedings of the Joint 25th Nordic Conference on Computa- tional Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 202...

2025
[26]

Pseudonymization categories across domain boundaries

Maria Irena Szawerna, Simon Dobnik, Therese Lindstr ¨om Tiede- mann, Ricardo Mu ˜noz S ´anchez, Xuan-Son Vu, and Elena V olo- dina. Pseudonymization categories across domain boundaries. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 13303–13314, 2024

2024
[27]

Gpt-ner: Named entity recognition via large language models.Findings of the association for computational linguistics: NAACL 2025, pages 4257–4275, 2025

Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, Guoyin Wang, and Chen Guo. Gpt-ner: Named entity recognition via large language models.Findings of the association for computational linguistics: NAACL 2025, pages 4257–4275, 2025

2025
[28]

Modeling tabular data using conditional gan

Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional gan. Advances in neural information processing systems, 32, 2019

2019
[29]

Ex- ploring the application of large language models in detecting and protecting personally identifiable information in archival data: a comprehensive study

Jianliang Yang, Xiya Zhang, Kai Liang, and Yuenan Liu. Ex- ploring the application of large language models in detecting and protecting personally identifiable information in archival data: a comprehensive study. In2023 IEEE International Conference on Big Data (BigData), pages 2116–2123. IEEE, 2023

2023
[30]

Automated privacy information annotation in large language model interactions.arXiv preprint arXiv:2505.20910, 2025

Hang Zeng, Xiangyu Liu, Yong Hu, Chaoyue Niu, Fan Wu, Shaojie Tang, and Guihai Chen. Automated privacy information annotation in large language model interactions.arXiv preprint arXiv:2505.20910, 2025. Appendix A. Evaluation Results TABLE 2. PER-STEP PRECISION(P),RECALL(R),ANDF 1 AT LABEL LEVEL AND INSTANCE LEVEL(FUZZY AND EXACT)ACROSS ALL THREE TAXONOMIE...

work page arXiv 2025
[31]

**Minimal span **: select the smallest substring that is the PII value (exclude keys, quotes, separators, and surrounding whitespace)
[32]

**Multi-value lists **: if multiple PII values appear in a list (comma-separated, array, repeated fields), annotate **each value separately **
[33]

- Example:`firstname=John&lastname=Doe`: annotate`John`and`Doe`separately (do NOT create a single 'John Doe' value)

**Split across fields **: if a combined concept is split into multiple fields, annotate each field's value separately. - Example:`firstname=John&lastname=Doe`: annotate`John`and`Doe`separately (do NOT create a single 'John Doe' value)
[34]

- If the **exact same value string** appears multiple times anywhere in the body (even under different keys), include it **only once** in`annotations`

**Deduplicate repeated values (strong rule)**: deduplicate **globally by exact value string**. - If the **exact same value string** appears multiple times anywhere in the body (even under different keys), include it **only once** in`annotations`. - Do not attempt 'semantic deduplication' (e.g., do not merge`John`and`Doe`)
[35]

--- ## 5) Use context to resolve ambiguity Many values are ambiguous in isolation (especially numbers and short strings)

**No inference**: annotate only what is explicitly present; do not guess missing pieces or reconstruct values. --- ## 5) Use context to resolve ambiguity Many values are ambiguous in isolation (especially numbers and short strings). Use **surrounding context**such as nearby keys, units, structural position, and typical formattingto choose the correct type...

2024