DeIDClinic: A Risk-Aware Pseudonymization Framework for Clinical Text De-identification and Re-identification Risk Assessment

Angel Paul; Dhivin Shaji; Goran Nenadic; Lifeng Han; Suzan Verberne; Warren Del-Pinto

arxiv: 2410.01648 · v2 · pith:HPX5BTB4new · submitted 2024-10-02 · 💻 cs.CL

DeIDClinic: A Risk-Aware Pseudonymization Framework for Clinical Text De-identification and Re-identification Risk Assessment

Angel Paul , Dhivin Shaji , Lifeng Han , Warren Del-Pinto , Goran Nenadic , Suzan Verberne This is my paper

classification 💻 cs.CL

keywords de-identificationriskassessmentdataframeworkpseudonymizationclinicalentity

0 comments

read the original abstract

The increasing availability of sensitive textual data has created an urgent need for robust de-identification methods that enable compliant data sharing while preserving downstream utility. This paper presents DeID-Clinic, a multi-layered framework for automated pseudonymization and re-identification risk assessment of clinical free-text data. Our approach integrates domain-adapted transformer models, including BioBERT and ClinicalBERT, into the MASK de-identification framework to improve the detection and masking of protected health information (PHI). Beyond entity recognition, we introduce a novel document-level risk assessment module that quantifies residual re-identification risk using a combination of k-anonymity, l-diversity, t-closeness, contextual similarity, and entity co-occurrence analysis. Experiments conducted on the i2b2 2014 de-identification dataset demonstrate strong performance, achieving macro-level F1 scores above 0.96 for several entity categories, while enabling quantitative prioritization of high-risk documents for further review. Our results highlight the effectiveness of combining neural de-identification with explicit risk modeling, supporting privacy-preserving data sharing in sensitive domains. Although evaluated on clinical text, the proposed framework is generalizable to other privacy-critical domains such as legal and administrative documents, where reliable pseudonymization and risk-aware anonymization are essential. Keywords{Automated De-Identification, Risk Assessment, Patient Privacy, Pseudonymization, Personal Health Information}

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Selective Token-Level Cryptographic Redaction for Privacy-Preserving Clinical Deployment of Large Language Models
cs.CL 2026-06 unverdicted novelty 4.0

HERALD selectively encrypts sensitive tokens via medical NER, POS policies, and deterministic ciphertext substitution to enable privacy-preserving clinical LLM use while recovering near-plaintext task performance.