pith. machine review for the scientific record. sign in

arxiv: 2605.12264 · v1 · submitted 2026-05-12 · 💻 cs.CR · cs.CL· cs.LG

Recognition: no theorem link

Reconstruction of Personally Identifiable Information from Supervised Finetuned Models

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:06 UTC · model grok-4.3

classification 💻 cs.CR cs.CLcs.LG
keywords personally identifiable informationPII reconstructionsupervised finetuninglarge language modelsprivacy leakagedecoding algorithmprefix-based attacksmedical and legal datasets
0
0 comments X

The pith

Supervised finetuned language models leak personally identifiable information that a new decoding algorithm can reconstruct more effectively than existing methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether supervised finetuning on instruction-response pairs in sensitive domains causes large language models to expose user-provided personal details. It builds multi-turn question-answer datasets in medical and legal settings that embed various types of PII to create a testbed for realistic leakage measurement. A new decoding procedure called COVA is presented for recovering this information when an attacker knows only prefixes from the training examples. Experiments indicate that even limited attacker knowledge raises reconstruction rates, although success rates differ markedly by the category of personal data involved. This matters for any organization that adapts general models to private-sector tasks, since the work shows that standard finetuning can turn user data into extractable content.

Core claim

The paper claims that SFT models trained on user-centric datasets in medical and legal domains retain and expose PII, which an adversary can recover through prefix-based attacks. The authors introduce COVA as a decoding algorithm that consistently outperforms prior extraction techniques in this reconstruction setting. Their results further establish that partial attacker knowledge about the fine-tuning data measurably increases reconstruction success while leakage levels vary substantially across different PII types.

What carries the argument

COVA, a novel decoding algorithm that reconstructs PII under prefix-based attacks by using partial attacker knowledge to guide token generation from the finetuned model.

If this is right

  • Partial attacker knowledge of the fine-tuning data can significantly increase the success rate of PII reconstruction.
  • The extent of leakage varies substantially across different categories of personally identifiable information.
  • SFT models in sensitive domains remain vulnerable to extraction even after finetuning completes.
  • Existing extraction methods are outperformed by COVA when attackers possess prefix information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Organizations using SFT on user data in regulated fields may need additional privacy controls beyond standard training practices.
  • The observed variation by PII type suggests that certain data fields could be prioritized for redaction or synthetic replacement during dataset construction.
  • Similar prefix-guided decoding techniques might be tested on other model adaptation methods such as reinforcement learning from human feedback.

Load-bearing premise

The multi-turn user-centric Q&A datasets constructed for medical and legal domains accurately represent realistic supervised finetuning scenarios and enable valid measurement of PII leakage.

What would settle it

Running COVA on independently collected real-world SFT datasets from medical or legal applications and finding that it does not outperform standard extraction methods would falsify the claimed reconstruction advantage.

Figures

Figures reproduced from arXiv: 2605.12264 by Alina Oprea, Sae Furukawa.

Figure 1
Figure 1. Figure 1: Examples of Attacker Goals Our focus is on measuring the leakage of associations between users and their sensitive data. Therefore, we do not consider un￾targeted attacks that aim to extract arbitrary or unassociated PII at scale. In addition, we focus exclusively on leakage arising from the fine-tuning dataset and do not consider potential information leakage from the pre-training data. Attacker’s Capabil… view at source ↗
Figure 2
Figure 2. Figure 2: Supervised Finetuning (SFT) Setup not have access to the model’s prior (i.e., the distribution induced by the model before fine-tuning). However, we further evaluate how reconstruction improves when the attacker leverages prior knowledge to increase confidence in the true PII. The central question we investigate is how varying levels of attacker knowledge affect the ability to infer associations between us… view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of COVA [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example of User Profile with Synthetic PII [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reconstruction rate (%) of the user’s name across [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of reconstruction performance (%) [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of reconstruction performance across [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of Data Annotation simplest first-name–last-name concatenations. Generated variants are augmented by multiple digits and characters [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Duplicate Count Per PII E Impact of Duplication Prior work has shown that duplication of PII or training data is correlated with memorization in LLMs [10, 34]. Following this line of work, we examine how reconstruction success varies with dupli￾cation count. Specifically, we group instances by their duplication frequency in the training dataset and measure reconstruction suc￾cess in both PII association a… view at source ↗
Figure 12
Figure 12. Figure 12: Reconstruction rate (%) per duplication Our findings indicate that the relationship between duplication and reconstruction is not uniformly strong. We do not observe a consistent positive correlation across PII types. While DOB exhibits a modest increasing trend with duplication count, email and name 0 5 10 15 20 Redacted Synthetic In-distribution Summary Keywords Demographics Name +11.81 +10.80 +11.20 +1… view at source ↗
Figure 14
Figure 14. Figure 14: Reconstruction rate (%) of the user’s name across [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
read the original abstract

Supervised Finetuning (SFT) has become one of the primary methods for adapting a large language model (LLM) with extensive pre-trained knowledge to domain-specific, instruction-following tasks. SFT datasets, composed of instruction-response pairs, often include user-provided information that may contain sensitive data such as personally identifiable information (PII), raising privacy concerns. This paper studies the problem of PII reconstruction from SFT models for the first time. We construct multi-turn, user-centric Q&A datasets in sensitive domains, specifically medical and legal settings, that incorporate PII to enable realistic evaluation of leakage. Using these datasets, we evaluate the extent to which an adversary, with varying levels of knowledge about the fine-tuning dataset, can infer sensitive information about individuals whose data was used during SFT. In the reconstruction setting, we propose COVA, a novel decoding algorithm to reconstruct PII under prefix-based attacks, consistently outperforming existing extraction methods. Our results show that even partial attacker knowledge can significantly improve reconstruction success, while leakage varies substantially across PII types.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to be the first to study reconstruction of personally identifiable information (PII) from supervised fine-tuned (SFT) LLMs. It constructs multi-turn user-centric Q&A datasets in medical and legal domains that incorporate PII, evaluates adversaries with varying levels of knowledge about the fine-tuning data, and proposes COVA, a novel decoding algorithm for PII reconstruction under prefix-based attacks that is claimed to consistently outperform existing extraction methods. Results indicate that even partial attacker knowledge significantly improves reconstruction success and that leakage varies substantially across PII types.

Significance. If the central empirical claims hold, the work would establish concrete privacy risks in applying SFT to sensitive domains and introduce a practical auditing tool (COVA) for measuring PII leakage. It highlights how attacker knowledge and PII type affect extraction success in multi-turn settings, which could guide data curation and defense strategies. The empirical framing is a strength, but impact is limited by the lack of demonstrated alignment between the constructed datasets and real SFT distributions.

major comments (2)
  1. [Dataset Construction] The evaluation of leakage variation across PII types and COVA's outperformance rests entirely on the author-constructed multi-turn medical/legal Q&A datasets. The manuscript provides no quantitative comparison (e.g., PII frequency per turn, context naturalness, or distributional statistics) to real-world SFT corpora, so the reported improvements from partial attacker knowledge and the differential leakage claims may not generalize beyond the synthetic construction process.
  2. [Results and Evaluation] The abstract and results sections assert that COVA 'consistently outperforming existing extraction methods' and that 'partial attacker knowledge can significantly improve reconstruction success,' yet no specific success rates, baseline implementations, error bars, or statistical tests are referenced. Without these, the magnitude and reliability of the claimed advantages cannot be assessed.
minor comments (1)
  1. [Abstract] The abstract refers to 'comparative results' without naming the baselines or metrics; adding one sentence with this information would improve clarity for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We address each major comment below in detail and indicate the revisions made to strengthen the paper.

read point-by-point responses
  1. Referee: [Dataset Construction] The evaluation of leakage variation across PII types and COVA's outperformance rests entirely on the author-constructed multi-turn medical/legal Q&A datasets. The manuscript provides no quantitative comparison (e.g., PII frequency per turn, context naturalness, or distributional statistics) to real-world SFT corpora, so the reported improvements from partial attacker knowledge and the differential leakage claims may not generalize beyond the synthetic construction process.

    Authors: We agree that greater transparency on dataset characteristics would help readers evaluate generalizability. In the revised manuscript we have expanded the dataset construction section to report quantitative statistics on the synthetic data, including PII frequency per turn, average context and turn lengths, and simple linguistic measures of naturalness. We have also added an explicit limitations paragraph noting that direct distributional comparisons to proprietary real-world SFT corpora are not feasible and that the reported trends should be interpreted in light of the synthetic construction process. These changes provide concrete information about our datasets while honestly acknowledging the absence of external validation. revision: partial

  2. Referee: [Results and Evaluation] The abstract and results sections assert that COVA 'consistently outperforming existing extraction methods' and that 'partial attacker knowledge can significantly improve reconstruction success,' yet no specific success rates, baseline implementations, error bars, or statistical tests are referenced. Without these, the magnitude and reliability of the claimed advantages cannot be assessed.

    Authors: The results section already contains the requested details: concrete success rates for COVA versus the implemented baselines (greedy decoding and beam search), error bars computed over repeated runs, and descriptions of the baseline configurations. To improve accessibility we have revised the abstract to summarize the key quantitative improvements and added explicit cross-references to the relevant tables and figures. We have further added paired statistical significance tests in the revised results section to substantiate the claims of consistent outperformance and the effect of partial attacker knowledge. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical evaluation without derivations or self-referential reductions

full rationale

The paper is a purely empirical study proposing the COVA decoding algorithm and measuring PII leakage on author-constructed multi-turn Q&A datasets in medical and legal domains. No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text or abstract. The central results (COVA outperforming baselines, effect of partial attacker knowledge, variation across PII types) are obtained by direct experimentation rather than any reduction to inputs by construction. Dataset construction is a standard methodological choice for privacy evaluation and does not create the self-definitional or fitted-input circularity patterns enumerated in the guidelines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; the work appears to rest on standard empirical ML assumptions about dataset realism and attack models.

pith-pipeline@v0.9.0 · 5485 in / 1047 out tokens · 87675 ms · 2026-05-13T04:06:41.535559+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 1 internal anchor

  1. [1]

    Ai4Privacy. 2024. Ai4Privacy Hugging Face Repository. https://huggingface.co/ ai4privacy. Accessed: 2026-04-28

  2. [2]

    Atilla Akkus, Masoud Poorghaffar Aghdam, Mingjie Li, Junjie Chu, Michael Backes, Yuyang Zhang, and Sinem Sav. 2025. Generated data with fake privacy: 12 Hidden dangers of fine-tuning large language models on generated data. In34th USENIX Security Symposium (USENIX Security 25). 8075–8093

  3. [3]

    Meenatchi Sundaram Muthu Selva Annamalai, Emiliano De Cristofaro, and Peter Kairouz. 2026. CLIOPATRA: Extracting Private Information from LLM Insights. arXiv preprint arXiv:2603.09781(2026)

  4. [4]

    Anthropic. 2024. The Claude 3 Model Family: Opus, Sonnet, Haiku. Anthropic Model Card(2024). https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

  5. [5]

    Teodora Baluta, Shiqi Shen, S Hitarth, Shruti Tople, and Prateek Saxena. 2022. Membership inference attacks and generalization: A causal perspective. InPro- ceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security. 249–262

  6. [6]

    Choquette-Choo, Chawin Sitawarin, Matthew Jagielski, Itay Yona, Petar Veličković, Ilia Shumailov, and Jamie Hayes

    Federico Barbero, Xiangming Gu, Christopher A. Choquette-Choo, Chawin Sitawarin, Matthew Jagielski, Itay Yona, Petar Veličković, Ilia Shumailov, and Jamie Hayes. 2025. Extracting alignment data in open models. arXiv:2510.18554 [cs.AI] https://arxiv.org/abs/2510.18554

  7. [7]

    Smith, and Christopher A

    Jaydeep Borkar, Matthew Jagielski, Katherine Lee, Niloofar Mireshghallah, David A. Smith, and Christopher A. Choquette-Choo. 2025. Privacy Ripple Effects from Adding or Removing Personal Information in Language Model Training. InFindings of the Association for Computational Linguistics: ACL 2025, Wanxi- ang Che, Joyce Nabende, Ekaterina Shutova, and Moham...

  8. [8]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.Advances in neural information processing systems33 (2020), 1877–1901

  9. [9]

    Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramèr. 2022. Membership Inference Attacks From First Principles. In 2022 IEEE Symposium on Security and Privacy (SP). 1897–1914. doi:10.1109/ SP46214.2022.9833649

  10. [10]

    Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2022. Quantifying memorization across neural language models. InThe Eleventh International Conference on Learning Represen- tations

  11. [11]

    Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert- Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. 2021. Extracting training data from large language models. In30th USENIX security symposium (USENIX Security 21). 2633–2650

  12. [12]

    Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural Legal Judgment Prediction in English. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 4317–4323. doi:10.18653/v1/P19-1424

  13. [13]

    Lichang Chen, Shiyang Li, Jun Yan, Hai Wang, Kalpa Gunaratna, Vikas Yadav, Zheng Tang, Vijay Srinivasan, Tianyi Zhou, Heng Huang, et al. 2023. Alpagasus: Training a better alpaca with fewer data.arXiv preprint arXiv:2307.08701(2023)

  14. [14]

    Xiaoyi Chen, Siyuan Tang, Rui Zhu, Shijun Yan, Lei Jin, Zihao Wang, Liya Su, Zhikun Zhang, XiaoFeng Wang, and Haixu Tang. 2024. The janus interface: How fine-tuning in large language models amplifies the privacy risks. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security. 1285–1299

  15. [15]

    Shuai Cheng, Zhao Li, Shu Meng, Mengxia Ren, Haitao Xu, Shuai Hao, Chuan Yue, and Fan Zhang. 2025. Understanding PII Leakage in Large Language Models: A Systematic Survey. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, James Kwok (Ed.). Interna- tional Joint Conferences on Artificial Intelligence Or...

  16. [16]

    Shuai Cheng, Shu Meng, Haitao Xu, Haoran Zhang, Shuai Hao, Chuan Yue, Wen- rui Ma, Meng Han, Fan Zhang, and Zhao Li. 2025. Effective{PII} Extraction from {LLMs} through Augmented {Few-Shot} Learning. In34th USENIX Security Symposium (USENIX Security 25). 8155–8173

  17. [17]

    Jiaxi Cui, Zongjian Li, Yang Yan, Bohua Chen, and Li Yuan. 2023. Chatlaw: Open-source legal large language model with integrated external knowledge bases.arXiv preprint arXiv:2306.160922 (2023)

  18. [18]

    Guanting Dong, Hongyi Yuan, Keming Lu, Chengpeng Li, Mingfeng Xue, Dayi- heng Liu, Wei Wang, Zheng Yuan, Chang Zhou, and Jingren Zhou. 2024. How Abilities in Large Language Models are Affected by Supervised Fine-tuning Data Composition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-W...

  19. [19]

    Wenjie Fu, Huandong Wang, Chen Gao, Guanghua Liu, Yong Li, and Tao Jiang

  20. [20]

    InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

    Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=PAWQvrForJ

  21. [21]

    Filippo Galli, Luca Melis, and Tommaso Cucinotta. 2024. Noisy Neighbors: Efficient membership inference attacks against LLMs. InProceedings of the Fifth Workshop on Privacy in Natural Language Processing, Ivan Habernal, Sepideh Ghanavati, Abhilasha Ravichander, Vijayanta Jain, Patricia Thaine, Timour Igamberdiev, Niloofar Mireshghallah, and Oluwaseyi Feyi...

  22. [22]

    Gigasheet. 2026. Free List of Law Practice Businesses (CSV). https://www. gigasheet.com/sample-data/free-list-of-law-practice-businessescsv. Accessed: 2026-04-29

  23. [23]

    Eric Goldman. 2020. An introduction to the california consumer privacy act (ccpa).Santa Clara Univ. Legal Studies Research Paper(2020)

  24. [24]

    Jamie Hayes, Ilia Shumailov, Christopher A. Choquette-Choo, Matthew Jagielski, Georgios Kaissis, Milad Nasr, Meenatchi Sundaram Muthu Selva Annamalai, Niloofar Mireshghallah, Igor Shilov, Matthieu Meeus, Yves-Alexandre de Mon- tjoye, Katherine Lee, Franziska Boenisch, Adam Dziedzic, and A. Feder Cooper

  25. [25]

    InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

    Exploring the limits of strong membership inference attacks on large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=x0i7wvRLHK

  26. [26]

    Health Resources and Services Administration. 2026. HRSA Data Warehouse: Health Center Service Delivery Site Data. https://data.hrsa.gov/data/download. Accessed: 2026-04-29

  27. [27]

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InInternational Conference on Learning Representations. https://openreview.net/forum?id=nZeVKeeFYf9

  28. [28]

    Zhiqiang Hu, Lei Wang, Yihuai Lan, Wanyu Xu, Ee-Peng Lim, Lidong Bing, Xing Xu, Soujanya Poria, and Roy Lee. 2023. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. InProceedings of the 2023 conference on empirical methods in natural language processing. 5254–5276

  29. [29]

    Anthony Hughes, Vasisht Duddu, N Asokan, Nikolaos Aletras, and Ning Ma

  30. [30]

    InFindings of the Association for Computational Linguistics: EACL 2026

    PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing. InFindings of the Association for Computational Linguistics: EACL 2026. 5139–5153

  31. [31]

    Hyejun Jeong, Shiqing Ma, and Amir Houmansadr. 2026. Bias Similarity Mea- surement: A Black-Box Audit of Fairness Across LLMs. InThe Fourteenth Inter- national Conference on Learning Representations. https://openreview.net/forum? id=EveruzAsGI

  32. [32]

    Seongho Keum, Dongwon Shin, Leo Marchyok, Sanghyun Hong, and Sooel Son

  33. [33]

    In34th USENIX Security Symposium (USENIX Security 25)

    Private Investigator: Extracting Personally Identifiable Information from Large Language Models Using Optimized Prompts. In34th USENIX Security Symposium (USENIX Security 25). 8175–8194

  34. [34]

    Bryan Klimt and Yiming Yang. 2004. The enron corpus: A new dataset for email classification research. InEuropean conference on machine learning. Springer, 217–226

  35. [35]

    Simon Lermen, Daniel Paleka, Joshua Swanson, Michael Aerni, Nicholas Carlini, and Florian Tramèr. 2026. Large-scale online deanonymization with LLMs.arXiv preprint arXiv:2602.16800(2026)

  36. [36]

    Jonathan Li, Rohan Bhambhoria, and Xiaodan Zhu. 2022. Parameter-efficient legal domain adaptation. InProceedings of the Natural Legal Language Processing Workshop 2022. 119–129

  37. [37]

    Zongjie Li, Daoyuan Wu, Shuai Wang, and Zhendong Su. 2025. Differentiation- based extraction of proprietary data from fine-tuned llms. InProceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security. 3071–3085

  38. [38]

    Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and San- tiago Zanella-Béguelin. 2023. Analyzing leakage of personally identifiable in- formation in language models. In2023 IEEE Symposium on Security and Privacy (SP). IEEE, 346–363

  39. [39]

    Mexwell. 2024. US Hospitals Dataset. https://www.kaggle.com/datasets/mexwell/ us-hospitals-dataset. Accessed: 2026-04-29

  40. [40]

    Niloofar Mireshghallah, Maria Antoniak, Yash More, Yejin Choi, and Golnoosh Farnadi. 2024. Trust No Bot: Discovering Personal Disclosures in Human-LLM Conversations in the Wild. InFirst Conference on Language Modeling. https: //openreview.net/forum?id=tIpWtMYkzU

  41. [41]

    Niloofar Mireshghallah and Tianshi Li. 2025. Position: Privacy Is Not Just Mem- orization!arXiv preprint arXiv:2510.01645(2025)

  42. [42]

    Feder Cooper, Daphne Ippolito, Christopher A

    Milad Nasr, Javier Rando, Nicholas Carlini, Jonathan Hayase, Matthew Jagielski, A. Feder Cooper, Daphne Ippolito, Christopher A. Choquette-Choo, Florian Tramèr, and Katherine Lee. 2025. Scalable Extraction of Training Data from Aligned, Production Language Models. InThe Thirteenth International Conference on Learning Representations. https://openreview.ne...

  43. [43]

    Vincent Nguyen, Sarvnaz Karimi, Maciej Rybinski, and Zhenchang Xing. 2023. MedRedQA for medical consumer question answering: Dataset, tasks, and neural baselines. InProceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volum...

  44. [44]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al

  45. [45]

    In 13 Advances in Neural Information Processing Systems, Vol

    Training language models to follow instructions with human feedback. In 13 Advances in Neural Information Processing Systems, Vol. 35. 27730–27744

  46. [46]

    Parelo Software. 2026. United States Cities Database. https://simplemaps.com/ data/us-cities. Accessed: 2026-04-29

  47. [47]

    Renjie Pi, Tianyang Han, Jianshu Zhang, Yueqi Xie, Rui Pan, Qing Lian, Hanze Dong, Jipeng Zhang, and Tong Zhang. 2024. MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Comput...

  48. [48]

    Protection Regulation. 2016. Regulation (EU) 2016/679 of the European Parlia- ment and of the Council.Regulation (eu)679, 2016 (2016), 10–3

  49. [49]

    Weiyan Shi, Aiqi Cui, Evan Li, Ruoxi Jia, and Zhou Yu. 2022. Selective differential privacy for language modeling. InProceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2848–2859

  50. [50]

    Devansh Singh and Sundaraparipurnan Narayanan. 2025. Unmasking the reality of pii masking models: Performance gaps and the call for accountability.arXiv preprint arXiv:2504.12308(2025)

  51. [51]

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al

  52. [52]

    Large language models encode clinical knowledge.Nature620, 7972 (2023), 172–180

  53. [53]

    Marton Szep, Jorge Marin Ruiz, Georgios Kaissis, Paulina Seidl, Rüdiger von Eisenhart-Rothe, Florian Hinterwimmer, and Daniel Rueckert. 2026. Unintended Memorization of Sensitive Information in Fine-Tuned Language Models. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers),...

  54. [54]

    Xinyu Tang, Saeed Mahloujifar, Liwei Song, Virat Shejwalkar, Milad Nasr, Amir Houmansadr, and Prateek Mittal. 2022. Mitigating membership inference attacks by {Self-Distillation} through a novel ensemble architecture. In31st USENIX security symposium (USENIX security 22). 1433–1450

  55. [55]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

  56. [56]

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-instruct: Aligning language models with self-generated instructions. InProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). 13484– 13508

  57. [57]

    Dai, and Quoc V Le

    Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022. Finetuned Language Models are Zero-Shot Learners. InInternational Conference on Learning Representations. https://openreview.net/forum?id=gEZrGCozdqR

  58. [58]

    Yuxiang Wei, Zhe Wang, Jiawei Liu, Yifeng Ding, and Lingming Zhang. 2023. Magicoder: Empowering code generation with oss-instruct.arXiv preprint arXiv:2312.02120(2023)

  59. [59]

    Yueqi Xie, Jingwei Yi, Jiawei Shao, Justin Curl, Lingjuan Lyu, Qifeng Chen, Xing Xie, and Fangzhao Wu. 2023. Defending chatgpt against jailbreak attack via self-reminders.Nature Machine Intelligence5, 12 (2023), 1486–1496

  60. [60]

    Xiaoyong Yuan and Lan Zhang. 2022. Membership inference attacks and defenses in neural network pruning. In31st USENIX Security Symposium (USENIX Security 22). 4561–4578

  61. [61]

    Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. 2024. MAmmoTH: Building Math Generalist Models through Hybrid Instruction Tuning. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=yLClGs770I

  62. [62]

    Shenglai Zeng, Yaxin Li, Jie Ren, Yiding Liu, Han Xu, Pengfei He, Yue Xing, Shuaiqiang Wang, Jiliang Tang, and Dawei Yin. 2023. Exploring memorization in fine-tuned language models.arXiv preprint arXiv:2310.06714(2023)

  63. [63]

    Liyi Zhang, Veniamin Veselovsky, R Thomas McCoy, and Thomas L Griffiths

  64. [64]

    Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models.arXiv preprint arXiv:2504.12585(2025)

  65. [65]

    john.smith@example.com

    Jian-Qiao Zhu and Thomas L Griffiths. 2024. Eliciting the priors of large language models using iterated in-context learning.arXiv preprint arXiv:2406.01860(2024). A Ethical Considerations Our research ensures compliance with ethical principles, including beneficence, respect for persons, justice, and adherence to legal and public interest. This study hig...

  66. [66]

    **Children’s Hospitals & Clinics OF MN**: Located at 2525 Chicago Avenue South, Minneapolis, MN 55404, with emergency services available

  67. [67]

    **Abbott Northwestern Hospital**: At 800 East 28TH Street, Minneapolis, MN 55407, this hospital has emergency services and is rated 4.0

  68. [68]

    **Hennepin County Medical Center 1**: Found at 701 Park Avenue, Minneapolis, MN 55415, this medical center has emergency services and is rated 2.0. For more information about services, hours, or to confirm availability, it’s best to call them directly or visit their websites.<|im_end|> C.4 Dataset Statistics After dataset processing, persona has an averag...