Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit

Amos A Folarin; Angus Roberts; Anoop D Shah; Anthony Shek; Aurelie Mascio; Daniel Bean; James T Teo; Kawsar Noor; Leilei Zhu; Lukasz Roguski

arxiv: 2010.01165 · v2 · pith:Z54RX3MNnew · submitted 2020-10-02 · 💻 cs.CL · cs.AI· cs.LG

Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit

Zeljko Kraljevic , Thomas Searle , Anthony Shek , Lukasz Roguski , Kawsar Noor , Daniel Bean , Aurelie Mascio , Leilei Zhu

show 10 more authors

Amos A Folarin Angus Roberts Rebecca Bendayan Mark P Richardson Robert Stewart Anoop D Shah Wai Keong Wong Zina Ibrahim James T Teo Richard JB Dobson

This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords clinicalconceptannotationconceptsdatasetsextractingextractionfurther

0 comments

read the original abstract

Electronic health records (EHR) contain large volumes of unstructured text, requiring the application of Information Extraction (IE) technologies to enable clinical analysis. We present the open-source Medical Concept Annotation Toolkit (MedCAT) that provides: a) a novel self-supervised machine learning algorithm for extracting concepts using any concept vocabulary including UMLS/SNOMED-CT; b) a feature-rich annotation interface for customising and training IE models; and c) integrations to the broader CogStack ecosystem for vendor-agnostic health system deployment. We show improved performance in extracting UMLS concepts from open datasets (F1:0.448-0.738 vs 0.429-0.650). Further real-world validation demonstrates SNOMED-CT extraction at 3 large London hospitals with self-supervised training over ~8.8B words from ~17M clinical records and further fine-tuning with ~6K clinician annotated examples. We show strong transferability (F1 > 0.94) between hospitals, datasets, and concept types indicating cross-domain EHR-agnostic utility for accelerated clinical and research use cases.

This paper has not been read by Pith yet.

Multi-domain Clinical Natural Language Processing with MedCAT: the Medical Concept Annotation Toolkit

discussion (0)