Applying deep learning techniques on medical corpora from the World Wide Web: a prototypical system and evaluation

Jose Antonio Mi\~narro-Gim\'enez; Matthias Samwald; Oscar Mar\'in-Alonso

arxiv: 1502.03682 · v1 · pith:G2S7VGLDnew · submitted 2015-02-12 · 💻 cs.CL · cs.IR· cs.LG· cs.NE

Applying deep learning techniques on medical corpora from the World Wide Web: a prototypical system and evaluation

Jose Antonio Mi\~narro-Gim\'enez , Oscar Mar\'in-Alonso , Matthias Samwald This is my paper

classification 💻 cs.CL cs.IRcs.LGcs.NE

keywords word2veccorporamedicalrelationshipsresultsaccuracyknowledgeability

0 comments

read the original abstract

BACKGROUND: The amount of biomedical literature is rapidly growing and it is becoming increasingly difficult to keep manually curated knowledge bases and ontologies up-to-date. In this study we applied the word2vec deep learning toolkit to medical corpora to test its potential for identifying relationships from unstructured text. We evaluated the efficiency of word2vec in identifying properties of pharmaceuticals based on mid-sized, unstructured medical text corpora available on the web. Properties included relationships to diseases ('may treat') or physiological processes ('has physiological effect'). We compared the relationships identified by word2vec with manually curated information from the National Drug File - Reference Terminology (NDF-RT) ontology as a gold standard. RESULTS: Our results revealed a maximum accuracy of 49.28% which suggests a limited ability of word2vec to capture linguistic regularities on the collected medical corpora compared with other published results. We were able to document the influence of different parameter settings on result accuracy and found and unexpected trade-off between ranking quality and accuracy. Pre-processing corpora to reduce syntactic variability proved to be a good strategy for increasing the utility of the trained vector models. CONCLUSIONS: Word2vec is a very efficient implementation for computing vector representations and for its ability to identify relationships in textual data without any prior domain knowledge. We found that the ranking and retrieved results generated by word2vec were not of sufficient quality for automatic population of knowledge bases and ontologies, but could serve as a starting point for further manual curation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Snomed2Vec: Random Walk and Poincar\'e Embeddings of a Clinical Knowledge Base for Healthcare Analytics
cs.LG 2019-07 unverdicted novelty 4.0

SNOMED-CT graph embeddings via random walks and Poincaré methods yield 5-6x better concept similarity and 6-20% better patient diagnosis prediction than prior embeddings.