Addressing Labelled Data Scarcity: Taxonomy-Agnostic Annotation of PII Values in HTTP Traffic using LLMs
Pith reviewed 2026-05-08 09:55 UTC · model grok-4.3
The pith
LLMs can annotate PII values in HTTP traffic when any taxonomy is supplied at runtime.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce a multi-stage LLM-based pipeline for taxonomy-agnostic annotation of explicitly transmitted PII values in HTTP message bodies, where the taxonomy is provided at runtime. The pipeline integrates deterministic pre-processing, label-level classification, targeted instance-level value annotation, and output validation. To support evaluation, they develop an LLM-based generator for synthetic HTTP traffic with manually validated PII annotations derived from taxonomies. Evaluation across three taxonomies demonstrates accurate detection of PII types and extraction of corresponding values.
What carries the argument
A multi-stage LLM pipeline that performs deterministic pre-processing, runtime taxonomy-guided label classification, instance-level value annotation, and output validation, backed by an LLM generator for synthetic annotated HTTP traffic.
If this is right
- Privacy audit systems can switch between different PII taxonomies without retraining or relabeling data.
- The need for large manually labeled traffic datasets shrinks because the pipeline generates its own annotations.
- Labeled synthetic traffic can be produced on demand for any new or evolving privacy taxonomy.
- Annotation works across taxonomies that differ in domain coverage and level of detail.
Where Pith is reading between the lines
- If the synthetic generator matches real traffic distributions, the same pipeline could bootstrap training sets for smaller specialized detectors.
- The method could extend beyond HTTP to other network flows once similar synthetic generators are built for them.
- Real-time privacy monitors might adopt the pipeline to keep pace with changing regulations without periodic full retraining.
Load-bearing premise
The LLM pipeline will stay accurate and low-hallucination when run on real HTTP traffic whose structure and content differ from the synthetic examples used for testing.
What would settle it
Applying the pipeline to a set of real captured HTTP traffic and measuring a large drop in PII type detection accuracy or value extraction correctness relative to the synthetic results.
Figures
read the original abstract
Automated privacy audits of web and mobile applications often analyse outbound HTTP traffic to detect Personally Identifiable Information (PII) leakage. However, existing learning-based detectors typically depend on scarce, manually labelled traffic and are tightly coupled to fixed label taxonomies, limiting transferability across domains and evolving definitions of PII. This paper investigates whether Large Language Models (LLMs) can support taxonomy-agnostic annotation of explicitly transmitted PII values in HTTP message bodies when the taxonomy is provided at runtime. We introduce a multi-stage LLM-based pipeline that combines deterministic pre-processing with label-level classification, targeted instance-level value annotation, and output validation. To enable controlled evaluation and exemplar-based prompting without relying on sensitive real-user captures, we further propose an LLM-based generator for synthetic HTTP traffic with manually validated, taxonomy-derived PII annotations. We evaluate the approach across three taxonomies spanning different PII domains and granularity levels. Results show that the pipeline accurately detects PII types and extracts corresponding values for concrete PII taxonomies. Overall, our findings position LLMs as a promising foundation for flexible, taxonomy-agnostic traffic annotation and for creating labelled data under evolving privacy taxonomies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes a multi-stage LLM pipeline for taxonomy-agnostic annotation of explicitly transmitted PII values in HTTP message bodies, supported by a companion LLM-based generator for synthetic HTTP traffic with manually validated annotations. It evaluates the pipeline across three PII taxonomies of varying domains and granularity, claiming that the approach accurately detects PII types and extracts corresponding values.
Significance. If the accuracy and low-hallucination claims hold, the work could meaningfully address labeled-data scarcity for privacy audits by enabling flexible, runtime-provided taxonomy annotation rather than fixed classifiers. The synthetic generator is a concrete strength, as it supports controlled, reproducible exemplar-based prompting without exposing real user data.
major comments (2)
- [§4] §4 (Evaluation setup): The reported results are obtained exclusively on synthetic HTTP traffic produced by the paper's own LLM generator. No experiments on real outbound HTTP traces are described, yet the abstract and conclusion assert practical utility for privacy audits of web and mobile applications. Differences in JSON nesting, encoding (base64/URL), mixed content types, and contextual noise between synthetic and real distributions could affect transfer; this directly bears on whether the taxonomy-agnostic positioning is supported.
- [Abstract and §4] Abstract and §4: The claim that the pipeline 'accurately detects PII types and extracts corresponding values' is stated without any quantitative metrics (precision, recall, F1, error rates, number of test instances per taxonomy, or baseline comparisons). This absence prevents assessment of effect size and undermines the strength of the central empirical claim.
minor comments (1)
- [§3] The pipeline diagram or pseudocode in §3 would improve clarity of the deterministic pre-processing, label-level classification, instance-level annotation, and validation stages.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will incorporate.
read point-by-point responses
-
Referee: [§4] §4 (Evaluation setup): The reported results are obtained exclusively on synthetic HTTP traffic produced by the paper's own LLM generator. No experiments on real outbound HTTP traces are described, yet the abstract and conclusion assert practical utility for privacy audits of web and mobile applications. Differences in JSON nesting, encoding (base64/URL), mixed content types, and contextual noise between synthetic and real distributions could affect transfer; this directly bears on whether the taxonomy-agnostic positioning is supported.
Authors: We agree that evaluation on real outbound traces would provide stronger evidence of transferability. The synthetic generator was chosen specifically to support controlled, reproducible experiments and exemplar-based prompting while avoiding exposure of real user data. It incorporates controlled variations in nesting, encodings, and mixed content to approximate real distributions. In the revised manuscript we will add an expanded discussion of generator fidelity and a dedicated Limitations section that explicitly addresses potential domain shifts, contextual noise, and the need for future validation on consented or anonymized real traces. revision: partial
-
Referee: [Abstract and §4] Abstract and §4: The claim that the pipeline 'accurately detects PII types and extracts corresponding values' is stated without any quantitative metrics (precision, recall, F1, error rates, number of test instances per taxonomy, or baseline comparisons). This absence prevents assessment of effect size and undermines the strength of the central empirical claim.
Authors: We acknowledge that the abstract and the high-level summary in §4 do not present the quantitative metrics, which weakens the empirical claims as noted. The evaluation section does contain per-taxonomy results, but these are not sufficiently highlighted or summarized with standard metrics and baselines. We will revise the abstract to include key quantitative findings (e.g., average and per-taxonomy F1 scores) and expand §4 with explicit tables reporting precision, recall, F1, number of test instances, error rates, and baseline comparisons. revision: yes
Circularity Check
No significant circularity; empirical pipeline with independent evaluation setup
full rationale
The paper presents an empirical multi-stage LLM pipeline for taxonomy-agnostic PII annotation in HTTP traffic and a separate LLM-based generator for creating synthetic labeled examples with manual validation. Evaluation results are reported on this synthetic data across three taxonomies, but no mathematical derivations, equations, fitted parameters, or self-referential reductions exist. The generator is proposed explicitly to avoid real-user data sensitivity and is not claimed to be derived from the annotation pipeline itself. No self-citations, uniqueness theorems, or ansatzes are invoked in a load-bearing manner. The work is self-contained as a practical pipeline description rather than a closed theoretical derivation, consistent with a non-circular empirical study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Datasets at hugging face
ai4privacy/pii Masking-200k. Datasets at hugging face. https: //huggingface.co/datasets/ai4privacy/pii-masking-200k, 2023. [Ac- cessed 26-02-2026]
2023
-
[2]
Generating multi-label discrete patient records using generative adversarial networks.Machine learning for healthcare conference, pages 286–305, 2017
Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Wal- ter F Stewart, and Jimeng Sun. Generating multi-label discrete patient records using generative adversarial networks.Machine learning for healthcare conference, pages 286–305, 2017
2017
-
[3]
A qualitative analysis framework for mhealth privacy practices
Thomas Cory, Wolf Rieder, and Thu-My Huynh. A qualitative analysis framework for mhealth privacy practices. In2024 IEEE European Symposium on Security and Privacy Workshops (Eu- roS&PW), pages 24–31. IEEE, 2024
2024
-
[4]
Word-level annotation of gdpr trans- parency compliance in privacy policies using large language mod- els.Proceedings on Privacy Enhancing Technologies, 1:509–528, 2026
Thomas Cory, Wolf Rieder, Julia Kr ¨amer, Philip Raschke, Patrick Herbke, and Axel K ¨upper. Word-level annotation of gdpr trans- parency compliance in privacy policies using large language mod- els.Proceedings on Privacy Enhancing Technologies, 1:509–528, 2026
2026
-
[5]
General Data Protection Regulation (GDPR) – Articles 13 and 14, 2016
European Parliament and Council of the European Union. General Data Protection Regulation (GDPR) – Articles 13 and 14, 2016. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data
2016
-
[6]
Applying and sharing pre-trained bert-models for named entity recognition and classifi- cation in swedish electronic patient records
Mila Grancharova and Hercules Dalianis. Applying and sharing pre-trained bert-models for named entity recognition and classifi- cation in swedish electronic patient records. InProceedings of the 23rd nordic conference on computational linguistics (NoDaLiDa), pages 231–239, 2021
2021
-
[7]
Pate- gan: Generating synthetic data with differential privacy guarantees
James Jordon, Jinsung Yoon, and Mihaela Van Der Schaar. Pate- gan: Generating synthetic data with differential privacy guarantees. InInternational conference on learning representations, 2018
2018
-
[8]
Learning to detect pii: Tabular vs
Rishika Kohli, Shaifu Gupta, Manoj Singh Gaur, and Soma S Dhavala. Learning to detect pii: Tabular vs. document classification models for network traffic analysis.Journal of Information Security and Applications, 94:104196, 2025
2025
-
[9]
Antmonitor: A system for monitoring from mobile devices
Anh Le, Janus Varmarken, Simon Langhoff, Anastasia Shuba, Minas Gjoka, and Athina Markopoulou. Antmonitor: A system for monitoring from mobile devices. InProceedings of the 2015 ACM SIGCOMM Workshop on Crowdsourcing and Crowdsharing of Big (Internet) Data, pages 15–20, 2015
2015
-
[10]
Mitigating bias in recruitment: A practical approach to cv de-identification considering privacy sensitive information
Sascha L ¨obner, Jetzabel Serna, Fr ´ed´eric Tronnier, Welderufael Tesfay, and Kai Rannenberg. Mitigating bias in recruitment: A practical approach to cv de-identification considering privacy sensitive information. InInternational Conference on Availability, Reliability and Security, pages 174–192. Springer, 2025
2025
-
[11]
Detecting personally identifiable information through natural language processing: A step forward
Luca Mainetti and Andrea Elia. Detecting personally identifiable information through natural language processing: A step forward. Applied System Innovation, 8(2):55, 2025
2025
-
[12]
Scalable multilingual pii an- notation for responsible ai in llms
Bharti Meena, Joanna Skubisz, Harshit Rajgarhia, Nand Dave, Kiran Ganesh, Shivali Dalmia, Abhishek Mukherji, Vasudevan Sundarababu, and Olga Pospelova. Scalable multilingual pii an- notation for responsible ai in llms. In2025 IEEE International Conference on Data Mining Workshops (ICDMW), pages 367–375. IEEE, 2025
2025
-
[13]
Google Play. Provide information for google play’s data safety section. https://support.google.com/googleplay/android-developer/ answer/10787469#zippy=%2Cdata-types, 2021. [Accessed 26-02- 2026]
-
[14]
Haystack: A multi-purpose mobile vantage point in user space.arXiv preprint arXiv:1510.01419, 2015
Abbas Razaghpanah, Narseo Vallina-Rodriguez, Srikanth Sundare- san, Christian Kreibich, Phillipa Gill, Mark Allman, and Vern Paxson. Haystack: A multi-purpose mobile vantage point in user space.arXiv preprint arXiv:1510.01419, 2015
-
[15]
Anat Reiner Benaim, Ronit Almog, Yuri Gorelik, Irit Hochberg, Laila Nassar, Tanya Mashiach, Mogher Khamaisi, Yael Lurie, Zaher S Azzam, Johad Khoury, et al. Analyzing medical research results based on synthetic data and their relation to real data results: systematic comparison from five observational studies.JMIR medical informatics, 8(2):e16492, 2020
2020
-
[16]
A longitudinal study of pii leaks across android app versions
Jingjing Ren, Martina Lindorfer, Daniel J Dubois, Ashwin Rao, David Choffnes, and Narseo Vallina-Rodriguez. A longitudinal study of pii leaks across android app versions. InNetwork and Distributed System Security Symposium (NDSS), volume 10, 2018
2018
-
[17]
Recon: Revealing and controlling pii leaks in mobile network traffic
Jingjing Ren, Ashwin Rao, Martina Lindorfer, Arnaud Legout, and David Choffnes. Recon: Revealing and controlling pii leaks in mobile network traffic. InProceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, pages 361–374, 2016
2016
-
[18]
Spy: Enhanc- ing privacy with synthetic pii detection dataset
Maksim Savkin, Timur Ionov, and Vasily Konovalov. Spy: Enhanc- ing privacy with synthetic pii detection dataset. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 4: Student Research Workshop), pages 236– 246, 2025
2025
-
[19]
Llm-based syn- thetic datasets: Applications and limitations in toxicity detection
Maximilian Schmidhuber and Udo Kruschwitz. Llm-based syn- thetic datasets: Applications and limitations in toxicity detection. InProceedings of the F ourth Workshop on Threat, Aggression & Cyberbullying@ LREC-COLING-2024, pages 37–51, 2024
2024
-
[20]
Pii-bench: Evaluating query-aware privacy protection systems.arXiv preprint arXiv:2502.18545, 2025
Hao Shen, Zhouhong Gu, Haokai Hong, and Weili Han. Pii-bench: Evaluating query-aware privacy protection systems.arXiv preprint arXiv:2502.18545, 2025
-
[21]
Privacyguard: A vpn-based platform to detect information leakage on android devices
Yihang Song and Urs Hengartner. Privacyguard: A vpn-based platform to detect information leakage on android devices. In Proceedings of the 5th Annual ACM CCS Workshop on Security and Privacy in Smartphones and Mobile Devices, pages 15–26, 2015
2015
-
[22]
Gaurav Srivastava, Kunal Bhuwalka, Swarup Kumar Sahoo, Sak- sham Chitkara, Kevin Ku, Matt Fredrikson, Jason Hong, and Yuvraj Agarwal. Privacyproxy: Leveraging crowdsourcing and in situ traffic analysis to detect and mitigate information leakage.arXiv preprint arXiv:1708.06384, 2017
-
[23]
California consumer privacy act of 2018, AB
State of California. California consumer privacy act of 2018, AB
2018
-
[24]
Accessed: 2026-02- 27
California Legislative Information, 2018. Accessed: 2026-02- 27
2018
-
[25]
The devil’s in the details: the detailedness of classes influences personal information detection and labeling
Maria Irena Szawerna, Simon Dobnik, Ricardo Munoz S ´anchez, and Elena V olodina. The devil’s in the details: the detailedness of classes influences personal information detection and labeling. InProceedings of the Joint 25th Nordic Conference on Computa- tional Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 202...
2025
-
[26]
Pseudonymization categories across domain boundaries
Maria Irena Szawerna, Simon Dobnik, Therese Lindstr ¨om Tiede- mann, Ricardo Mu ˜noz S ´anchez, Xuan-Son Vu, and Elena V olo- dina. Pseudonymization categories across domain boundaries. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 13303–13314, 2024
2024
-
[27]
Gpt-ner: Named entity recognition via large language models.Findings of the association for computational linguistics: NAACL 2025, pages 4257–4275, 2025
Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, Guoyin Wang, and Chen Guo. Gpt-ner: Named entity recognition via large language models.Findings of the association for computational linguistics: NAACL 2025, pages 4257–4275, 2025
2025
-
[28]
Modeling tabular data using conditional gan
Lei Xu, Maria Skoularidou, Alfredo Cuesta-Infante, and Kalyan Veeramachaneni. Modeling tabular data using conditional gan. Advances in neural information processing systems, 32, 2019
2019
-
[29]
Ex- ploring the application of large language models in detecting and protecting personally identifiable information in archival data: a comprehensive study
Jianliang Yang, Xiya Zhang, Kai Liang, and Yuenan Liu. Ex- ploring the application of large language models in detecting and protecting personally identifiable information in archival data: a comprehensive study. In2023 IEEE International Conference on Big Data (BigData), pages 2116–2123. IEEE, 2023
2023
-
[30]
Hang Zeng, Xiangyu Liu, Yong Hu, Chaoyue Niu, Fan Wu, Shaojie Tang, and Guihai Chen. Automated privacy information annotation in large language model interactions.arXiv preprint arXiv:2505.20910, 2025. Appendix A. Evaluation Results TABLE 2. PER-STEP PRECISION(P),RECALL(R),ANDF 1 AT LABEL LEVEL AND INSTANCE LEVEL(FUZZY AND EXACT)ACROSS ALL THREE TAXONOMIE...
-
[31]
**Minimal span **: select the smallest substring that is the PII value (exclude keys, quotes, separators, and surrounding whitespace)
-
[32]
**Multi-value lists **: if multiple PII values appear in a list (comma-separated, array, repeated fields), annotate **each value separately **
-
[33]
- Example:`firstname=John&lastname=Doe`: annotate`John`and`Doe`separately (do NOT create a single 'John Doe' value)
**Split across fields **: if a combined concept is split into multiple fields, annotate each field's value separately. - Example:`firstname=John&lastname=Doe`: annotate`John`and`Doe`separately (do NOT create a single 'John Doe' value)
-
[34]
- If the **exact same value string** appears multiple times anywhere in the body (even under different keys), include it **only once** in`annotations`
**Deduplicate repeated values (strong rule)**: deduplicate **globally by exact value string**. - If the **exact same value string** appears multiple times anywhere in the body (even under different keys), include it **only once** in`annotations`. - Do not attempt 'semantic deduplication' (e.g., do not merge`John`and`Doe`)
-
[35]
--- ## 5) Use context to resolve ambiguity Many values are ambiguous in isolation (especially numbers and short strings)
**No inference**: annotate only what is explicitly present; do not guess missing pieces or reconstruct values. --- ## 5) Use context to resolve ambiguity Many values are ambiguous in isolation (especially numbers and short strings). Use **surrounding context**such as nearby keys, units, structural position, and typical formattingto choose the correct type...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.