The secondary use of Electronic Health Records [EHRs] has transformative potential for how healthcare research is conducted, yielding new possibilities in areas of business intelligence, observational research, clinical trial recruitment and decision support.
However, as much as 80% of the data in the EHR are known to be locked in the form of unstructured text, making this information 'invisible' for standard analysis techniques.
This two-day course will provide an introduction to the field of clinical natural language processing [NLP], from its origins before the advent of 'Big Data', to the current state of the art, comprising information extraction algorithms processing millions of documents on supercomputer hardware.
In addition, this course will convey an appreciation of the complexity that different NLP problems pose, via a series of talks and practical sessions.
For individuals wishing to participate fully in practical sessions, some basic programming experience, ideally with Java or Python, is recommended.
Planned Timetable
Day 1
TimeSession TitleLead Tutor09:30-10:00IntroductionAngus Roberts10:00-11:00Practical session: Introducing GATEDeveloperAngus RobertsCoffee 11:15-11:45Practical session: Information
Extraction with ANNIEAngus Roberts11:45 - 12:15Group discussion: Issues when
building NLP IE applicationsAngus Roberts12:15-12:45Practical session: Simple information
extraction with pattern matchingAngus RobertsLunch 13:45-15:15continued. Practical session: Simple
information extraction with pattern
matchingAngus Roberts15:15-15:30Coffee 15:30-17:00Practical session: A medications
example using pattern matchingAngus Roberts17:00Close
Day 2
TimeSession TitleLead Tutor09:00 – 10:30Practical Session continued: Amedications example using pattern
matchingAngus Roberts10:30 - 10:45Coffee 10:45 – 11:30Group discussion: Issues when
building NLP IE applications -
validationAngus Roberts11:30-12:15Machine Learning for NLP:
IntroductionAngus RobertsLunch 13:15-14:45Practical session: Supervised
Machine Learning - classificationAngus Roberts14:45-15:15Coffee 15:15-17:00Practical session: Supervised
Machine Learning - chunkingAngus Roberts17:00Close
Course Team
Angus is a Senior Research Fellow in The University of Sheffield's Natural Language Processing group in the Department of Computer Science. His main research interests are: extraction of meaning from biomedical texts, such as medical records and medical research papers; and text mining software and infrastructure.
He is a member of the GATE team, for which he leads life science related work. GATE is a widely used software platform and framework for large-scale text mining and language engineering. It is used in the life sciences by medical record software companies, pharmaceutical companies, genetics researchers, and many others.
Angus originally trained and worked as a Biomedical Scientist, before working as a software developer and development manager, mainly in the UK National Health Service. This led to an interest in medical terminologies, ontologies, and the language of medical text.
Alright, in general, we know a significant proportion of the world’s data is in an unstructured format like news articles, Tweets and blogs. Some say 80% of our data is unstructured, while others estimate even more. Unsurprisingly, such phenomena is also observed in health care, such as electronic health records at hospitals.
//physionet.org/content/mimiciii/1.4/
For example, there is a widely used clinical dataset called MIMIC, which contains 11 years of data from 2 US based intensive care units. It has nearly 60 thousand ICU admissions. On average, there are more than 34 clinical notes for each admission. These notes contain very important information that is never recorded in MIMIC’s structured database, including the drugs people were using at admission, and patient’s past medical history such as whether they had stroke or heart attacks before.
Another example, in 2018, H. Kharrazi and colleagues published a paper which studied the value of fee-text hospital data in identifying conditions that are prevalent in older adults, such as fall and dementia. They produced a very nice chart as shown here. To me, this is a great chart because it delivers a very strong message on how useful and critical free-text data is.
Reproduced from: J Am Geriatr Soc. 2018 Aug;66[8]:1499-1507. doi: 10.1111/jgs.15411
In each of the Venn diagrams,
the top right circle represents claims data [in red color], the bottom right circle represents structured EHR data [in blue], and the left circle represents unstructured free-text EHR data [in green].
In all geriatric syndrome cases, the Venn diagram shows the significance of free-text data. In particular, for “lack of social support” and “malnutrition” cases, the unstructured data constitutes more than 95% of the cases. Even for things like “dementia”, which one might expect good coded data, free-text data still helped to identify a lot of cases that are missing from the structured data.
40% of incident stroke cases ascertained via linked, structured data in the prospective, population-based UK Biobank cohort are coded as being of unspecified type.
In addition to helping identify cases missed by structured data, free-text data is also very valuable in so-called “deep phenotyping”, that is to identify more specific information of a patient’s condition. Let’s use stroke as an example. In a study published in July 2020, Kristiina and colleagues studied incident stroke cases of UK Biobank cohort in Scotland, which contains 17k people. They found 40% of the cases were coded as unspecified stroke in structured data. We know there are three main subtypes of stroke, they have different causes and hence need to be treated differently. Only knowing unspecified stroke is apparently not very useful. Fortunately, a further study has shown free-text data such as radiology reports can help identify the stroke subtypes of all these 40% cases.
Summary
Ok, till now, I hope I have successfully convinced you that free-text data is important and critical in some scenarios in disease phenotyping.