Validating transformers for redaction of text from electronic health records in real-world healthcare

Anoop D. Shah; Anthony Shek; Ewart Jonathan Sheldon; Haris Shuaib; James Teo; Joshua Au Yeung; Kawsar Noor; Mohammad Al-Agil; Richard Dobson; Xi Bai

arxiv: 2310.04468 · v1 · pith:5RMKWPZVnew · submitted 2023-10-05 · 💻 cs.CL · cs.AI

Validating transformers for redaction of text from electronic health records in real-world healthcare

Zeljko Kraljevic , Anthony Shek , Joshua Au Yeung , Ewart Jonathan Sheldon , Mohammad Al-Agil , Haris Shuaib , Xi Bai , Kawsar Noor

show 3 more authors

Anoop D. Shah Richard Dobson James Teo

This is my paper

classification 💻 cs.CL cs.AI

keywords real-worldhealthcarehealthhospitalsredactiontextalgorithmsanoncat

0 comments

read the original abstract

Protecting patient privacy in healthcare records is a top priority, and redaction is a commonly used method for obscuring directly identifiable information in text. Rule-based methods have been widely used, but their precision is often low causing over-redaction of text and frequently not being adaptable enough for non-standardised or unconventional structures of personal health information. Deep learning techniques have emerged as a promising solution, but implementing them in real-world environments poses challenges due to the differences in patient record structure and language across different departments, hospitals, and countries. In this study, we present AnonCAT, a transformer-based model and a blueprint on how deidentification models can be deployed in real-world healthcare. AnonCAT was trained through a process involving manually annotated redactions of real-world documents from three UK hospitals with different electronic health record systems and 3116 documents. The model achieved high performance in all three hospitals with a Recall of 0.99, 0.99 and 0.96. Our findings demonstrate the potential of deep learning techniques for improving the efficiency and accuracy of redaction in global healthcare data and highlight the importance of building workflows which not just use these models but are also able to continually fine-tune and audit the performance of these algorithms to ensure continuing effectiveness in real-world settings. This approach provides a blueprint for the real-world use of de-identifying algorithms through fine-tuning and localisation, the code together with tutorials is available on GitHub (https://github.com/CogStack/MedCAT).

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SHIELD: A Diverse Clinical Note Dataset and Distilled Small Language Models for Enterprise-Scale De-identification
cs.CL 2026-05 unverdicted novelty 6.0

SHIELD is a new diverse clinical note dataset paired with distilled small language models that achieve 0.89 span-level precision and 0.88 recall for on-premise PHI de-identification.
SHIELD: A Diverse Clinical Note Dataset and Distilled Small Language Models for Enterprise-Scale De-identification
cs.CL 2026-05 conditional novelty 5.0

SHIELD dataset and distilled DeBERTa v3 model achieve 0.88 micro precision and 0.86 recall on PHI de-identification while matching teacher performance on structured categories.