What is free text in an EHR?

The secondary use of Electronic Health Records (EHRs) has transformative potential for how healthcare research is conducted, yielding new possibilities in areas of business intelligence, observational research, clinical trial recruitment and decision support.

However, as much as 80% of the data in the EHR are known to be locked in the form of unstructured text, making this information 'invisible' for standard analysis techniques.

This two-day course will provide an introduction to the field of clinical natural language processing (NLP), from its origins before the advent of 'Big Data', to the current state of the art, comprising information extraction algorithms processing millions of documents on supercomputer hardware.

In addition, this course will convey an appreciation of the complexity that different NLP problems pose, via a series of talks and practical sessions.

For individuals wishing to participate fully in practical sessions, some basic programming experience, ideally with Java or Python, is recommended.

Planned Timetable

Day 1

TimeSession TitleLead Tutor09:30-10:00IntroductionAngus Roberts10:00-11:00Practical session: Introducing GATE
DeveloperAngus RobertsCoffee  11:15-11:45Practical session: Information
Extraction with ANNIEAngus Roberts11:45 - 12:15Group discussion: Issues when
building NLP IE applicationsAngus Roberts12:15-12:45Practical session: Simple information
extraction with pattern matchingAngus RobertsLunch  13:45-15:15continued. Practical session: Simple
information extraction with pattern
matchingAngus Roberts15:15-15:30Coffee 15:30-17:00Practical session: A medications
example using pattern matchingAngus Roberts17:00Close 

Day 2

TimeSession TitleLead Tutor09:00 – 10:30Practical Session continued: A
medications example using pattern
matchingAngus Roberts10:30 - 10:45Coffee 10:45 – 11:30Group discussion: Issues when
building NLP IE applications -
validationAngus Roberts11:30-12:15Machine Learning for NLP:
IntroductionAngus RobertsLunch  13:15-14:45Practical session: Supervised
Machine Learning - classificationAngus Roberts14:45-15:15Coffee 15:15-17:00Practical session: Supervised
Machine Learning - chunkingAngus Roberts17:00Close 


Course Team

Angus is a Senior Research Fellow in The University of Sheffield's Natural Language Processing group in the Department of Computer Science. His main research interests are: extraction of meaning from biomedical texts, such as medical records and medical research papers; and text mining software and infrastructure.

He is a member of the GATE team, for which he leads life science related work. GATE is a widely used software platform and framework for large-scale text mining and language engineering. It is used in the life sciences by medical record software companies, pharmaceutical companies, genetics researchers, and many others.

Angus originally trained and worked as a Biomedical Scientist, before working as a software developer and development manager, mainly in the UK National Health Service. This led to an interest in medical terminologies, ontologies, and the language of medical text.

Alright, in general, we know a significant proportion of the world’s data is in an unstructured format like news articles, Tweets and blogs. Some say 80% of our data is unstructured, while others estimate even more. Unsurprisingly, such phenomena is also observed in health care, such as electronic health records at hospitals.

What is free text in an EHR?


For example, there is a widely used clinical dataset called MIMIC, which contains 11 years of data from 2 US based intensive care units. It has nearly 60 thousand ICU admissions. On average, there are more than 34 clinical notes for each admission. These notes contain very important information that is never recorded in MIMIC’s structured database, including the drugs people were using at admission, and patient’s past medical history such as whether they had stroke or heart attacks before.

Another example, in 2018, H. Kharrazi and colleagues published a paper which studied the value of fee-text hospital data in identifying conditions that are prevalent in older adults, such as fall and dementia. They produced a very nice chart as shown here. To me, this is a great chart because it delivers a very strong message on how useful and critical free-text data is.

What is free text in an EHR?

Reproduced from: J Am Geriatr Soc. 2018 Aug;66(8):1499-1507. doi: 10.1111/jgs.15411

In each of the Venn diagrams,

the top right circle represents claims data (in red color), the bottom right circle represents structured EHR data (in blue), and the left circle represents unstructured free-text EHR data (in green).

In all geriatric syndrome cases, the Venn diagram shows the significance of free-text data. In particular, for “lack of social support” and “malnutrition” cases, the unstructured data constitutes more than 95% of the cases. Even for things like “dementia”, which one might expect good coded data, free-text data still helped to identify a lot of cases that are missing from the structured data.

What is free text in an EHR?

40% of incident stroke cases ascertained via linked, structured data in the prospective, population-based UK Biobank cohort are coded as being of unspecified type.

In addition to helping identify cases missed by structured data, free-text data is also very valuable in so-called “deep phenotyping”, that is to identify more specific information of a patient’s condition. Let’s use stroke as an example. In a study published in July 2020, Kristiina and colleagues studied incident stroke cases of UK Biobank cohort in Scotland, which contains 17k people. They found 40% of the cases were coded as unspecified stroke in structured data. We know there are three main subtypes of stroke, they have different causes and hence need to be treated differently. Only knowing unspecified stroke is apparently not very useful. Fortunately, a further study has shown free-text data such as radiology reports can help identify the stroke subtypes of all these 40% cases.


Ok, till now, I hope I have successfully convinced you that free-text data is important and critical in some scenarios in disease phenotyping.

What is free text data?

Free Text is the string based data that comes from allowing people to type answers in to systems and forms. The resulting data is normally stored within one column, with one answer per cell. As Free Text means the answer could be anything, this is what you get - absolutely anything.

What is free EHR?

A free EHR is an electronic health records system that is offered at no cost.

What are the 3 components of the EHR system?

Elements of EHRs Compared to paper records, electronic health records contain more information about the patient and their care. Most EHRs contain the following information: Patient's demographic, billing, and insurance information.

What are the five functional components of an EHR?

Electronic Health Records: The Basics Patient demographics. Progress notes. Vital signs. Medical histories.