LING 575 F/G - Clinical Natural Language Processing



Most patient information that describes the patient state, diagnostic procedure, and disease progress is represented in free-text clinical notes. The information in notes can be found in the form of narrative and semi-structured format through lists or templates with free-text fields. These resources provide an opportunity for Natural Language Processing approaches to play a major role in biomedical research and clinical care by facilitating automated analysis of free-text information, which otherwise is accessible only through manual review. The goal of this seminar is to review current efforts in processing clinical text and to provide an opportunity for the students to have hands-on experience with publicly available clinical datasets. Prerequisites: This is a hands-on course that will involve building clinical NLP systems based on publicly available datasets. Students should have taken Ling 571 & 572.


Project will be 100% of your grade.



1. Extracting environmental factors from clinical notes:

Lifestyle and environmental factors play a significant role both in clinical research as well as clinical care. In clinical research, it has been established that 5-10% of cancers can be attributed to hereditary factors, while 90-95% have been found correlated with lifestyle and environmental factors such as smoking, diet and exercise. For clinical care, it has long been practice to record social history during clinical care as this history impacts not only diagnosis but also treatment options.

Dataset was created a corpus from MTSamples website ( The website provides a large collection of publicly available transcribed medical records. We created a detailed annotation guideline to annotate the following lifestyle and environment factors: (1) substance abuse (smoking, alcohol and drug use), (2) occupation, (3) marital status, (4) family information, (5) residence, (6) living situation, (7) environmental exposures, (8) physical activity, (9) weight management, (10) sexual history, and (11) infectious disease history. We then defined 9 different dimensions that might apply to each type of factor; i.a., for substance abuse (1), annotations are made regarding status (possible values: past, current, none, unknown), time frame (e.g. since 2010), method (e.g. drink, inhale, inject), type (e.g. cigarettes, wine, cocaine), amount (e.g. # of cigrettes|drinks), frequency (e.g. daily, socially, rarely), and history (e.g. after 10 years of smoking), while for occupation (2), location and extent (e.g. part-time, night-shift) dimensions are annotated. 300+ social history sections were annotated for system development.

Details of the dataset available at: M. Yetisgen, E. Pellicer, D.R. Crosslin, L. Vanderwende. Automatic Identification of Lifestyle and Environmental Factors from Social History in Clinical Text. To appear in Proceedings of AMIA 2016 Joint Summits on Translational Science.

2. TREC 2016 Challenge on Clinical Decision Support systems: (Important Note: This has been released last week and I applied for data access!)

This is the formal call for participant announcement for the challenge.

In making clinical decisions, physicians often seek out information about how to best care for their patients. Information relevant to a physician can be related to a variety of clinical tasks such as determining a patient's most likely diagnosis given a list of symptoms, deciding on the most effective treatment plan for a patient having a known condition, and determining if a particular test is indicated for a given situation. In some cases, physicians can find the information they seek in published biomedical literature. However, given the volume of the existing literature and the rapid pace at which new research is published, locating the most relevant and timely information for a particular clinical need can be a daunting and time-consuming task.

In order to make biomedical information more accessible and to meet the requirements for the meaningful use of electronic health records, a goal of clinical decision support systems is to anticipate the needs of physicians by linking medical records with information relevant for patient care. The goal of the clinical decision support track is to simulate the requirements of such systems and encourage the creation of tools and resources necessary for their implementation.

--- Task Description ---
Similar to the 2014 and 2015 tracks, the focus of the 2016 Clinical Decision Support Track will be the retrieval of biomedical articles relevant for answering generic clinical questions about medical records. We will be using EHR admission notes describing the patientís chief complaint and history of present illness. Participants of the track will be challenged with retrieving full-text biomedical articles that answer questions related to several types of clinical information needs. Each topic will consist of an EHR note and one of three generic clinical question types, such as "What is the patient's diagnosis?" Retrieved articles will be judged relevant if they provide information of the specified type that is pertinent to the given patient. The evaluation of submissions will follow standard TREC evaluation procedures.

3. Smoking information extraction:

2006 i2b2 NLP challenge dataset. The dataset is annotated for patients' smoking status (past smoker, current smoker, non-smoker, unknown).
Dataset paper:

4. Concept Extraction:

2010 i2b2 NLP challenge dataset. The dataset is annotated for concepts that describe medical problem, treatment, and test.
Dataset paper:

5. Assertion Extraction:

2010 i2b2 NLP challenge dataset. The dataset is annotated for assertion values for medical problems. Assertion values include: present, absent, conditional, hypothetical, possible, not-patient.
Dataset paper:

6. Co-reference Resolution:

2011 i2b2 NLP challenge dataset. The dataset is annotated for noun-phrase co-reference information.
Dataset paper:

7. Temporal Relation Extraction:

2012 i2b2 NLP challenge dataset. The dataset is annotated for temporal relations. Temporal information in clinical narratives plays an important role in patients' diagnosis, treatment and prognosis. In order to represent narrative information accurately, medical natural language processing (MLP) systems need to correctly identify and interpret temporal information. To promote research in this area, the Informatics for Integrating Biology and the Bedside (i2b2) project developed a temporally annotated corpus of clinical narratives. This corpus contains 310 de-identified discharge summaries, with annotations of clinical events, temporal expressions and temporal relations.
Dataset paper: