LING 575 F/G - Clinical Natural Language Processing
Lojistics:
- Instructor: Meliha Yetisgen
- Time: Mondays, 3:30-5:50p.m.
- Location: PAA, Room A-214
- Email List: ling575g_sp16@uw.edu
Description:
Most patient information that describes the patient state, diagnostic procedure, and disease progress is represented in free-text clinical notes. The information in notes can be found in the form of narrative and semi-structured format through lists or templates with free-text fields. These resources provide an opportunity for Natural Language Processing approaches to play a major role in biomedical research and clinical care by facilitating automated analysis of free-text information, which otherwise is accessible only through manual review. The goal of this seminar is to review current efforts in processing clinical text and to provide an opportunity for the students to have hands-on experience with publicly available clinical datasets.
Prerequisites: This is a hands-on course that will involve building clinical NLP systems based on publicly available datasets. Students should have taken Ling 571 & 572.
Project:
Project will be 100% of your grade.
-
Part 1: Presentation: Survey or paper presentation of a selected data/text mining research area.
-
Part 2: A system implementation: The project should involve at least one text processing problem discussed during the first two weeks of the class. Students will either pick one of the publicly available datasets or use a dataset of their own.
Schedule:
-
Week #1 - 03/28/2016: Introductions
-
Week #2 - 04/04/2016: Datasets
-
Week #3 - 04/11/2016: Project topics
-
Week #4 - 04/18/2016: Resources, project updates, time setup for literature review presentations
-
Week #5 - 04/25/2016: Project updates
-
Week #6 - 05/02/2016: Project updates
-
Week #7 - 05/09/2016: Project updates
-
Week #8 - 05/16/2016: Literature review presentations, project updates
- Ethan, Maria: Obesity extraction
- Adyasha: Social History
- Micaela, Nick: Social History
-
Week #9 - 05/23/2016: Literature review presentations, project updates
- Elizabeth, Kenedy: Concept extraction
- Kennedy, William, Spencer: Social History
- Chris, Jason, Adam: TREC
- Dae: Assertion
-
Week#10 - Holiday
-
Week#11 - Finals
AVAILABLE DATASETS & PROJECT IDEAS
1. Extracting environmental factors from clinical notes:
Lifestyle and environmental factors play a significant role both in clinical research as well as clinical care.
In clinical research, it has been established that 5-10% of cancers can be attributed to hereditary factors,
while 90-95% have been found correlated with lifestyle and environmental factors such as smoking, diet and exercise.
For clinical care, it has long been practice to record social history during clinical care as this history impacts
not only diagnosis but also treatment options.
Dataset was created a corpus from MTSamples website (http://www.mtsamples.com/). The website provides a large collection
of publicly available transcribed medical records. We created a detailed annotation guideline to annotate the following
lifestyle and environment factors: (1) substance abuse (smoking, alcohol and drug use), (2) occupation, (3) marital status,
(4) family information, (5) residence, (6) living situation, (7) environmental exposures, (8) physical activity,
(9) weight management, (10) sexual history, and (11) infectious disease history. We then defined 9 different dimensions
that might apply to each type of factor; i.a., for substance abuse (1), annotations are made regarding status (possible
values: past, current, none, unknown), time frame (e.g. since 2010), method (e.g. drink, inhale, inject), type
(e.g. cigarettes, wine, cocaine), amount (e.g. # of cigrettes|drinks), frequency (e.g. daily, socially, rarely),
and history (e.g. after 10 years of smoking), while for occupation (2), location and extent (e.g. part-time, night-shift)
dimensions are annotated. 300+ social history sections were annotated for system development.
Details of the dataset available at: M. Yetisgen, E. Pellicer, D.R. Crosslin, L. Vanderwende. Automatic Identification of Lifestyle and
Environmental Factors from Social History in Clinical Text. To appear in Proceedings of AMIA 2016 Joint Summits on Translational Science.
2. TREC 2016 Challenge on Clinical Decision Support systems: (Important Note: This has been released last week and I applied for data access!)
This is the formal call for participant announcement for the challenge.
In making clinical decisions, physicians often seek out information
about how to best care for their patients. Information relevant to a physician can be related to a variety of clinical tasks such as
determining a patient's most likely diagnosis given a list of symptoms, deciding on the most effective treatment plan for a patient
having a known condition, and determining if a particular test is indicated for a given situation. In some cases, physicians can find
the information they seek in published biomedical literature. However, given the volume of the existing literature and the rapid pace
at which new research is published, locating the most relevant and timely information for a particular clinical need can be a daunting
and time-consuming task.
In order to make biomedical information more accessible and to meet the requirements for the meaningful use of electronic health records,
a goal of clinical decision support systems is to anticipate the needs of physicians by linking medical records with information relevant
for patient care. The goal of the clinical decision support track is to simulate the requirements of such systems and encourage the
creation of tools and resources necessary for their implementation.
--- Task Description ---
- Topics/Queries: EHR admission notes
- Data Collection: Open Access snapshot of PubMed Central (PMC) *updated for 2016*
- Website: http://www.trec-cds.org
- Mailing List: http://groups.google.com/d/forum/trec-cds
Similar to the 2014 and 2015 tracks, the focus of the 2016 Clinical Decision Support Track will be the retrieval of biomedical
articles relevant for answering generic clinical questions about medical records.
We will be using EHR admission notes describing the patient’s chief complaint and history of present illness.
Participants of the track will be challenged with retrieving full-text biomedical articles that answer questions related
to several types of clinical information needs. Each topic will consist of an EHR note and one of three generic clinical
question types, such as "What is the patient's diagnosis?" Retrieved articles will be judged relevant if they provide
information of the specified type that is pertinent to the given patient. The evaluation of submissions will follow
standard TREC evaluation procedures.
3. Smoking information extraction:
2006 i2b2 NLP challenge dataset. The dataset is annotated for patients' smoking status (past smoker, current smoker, non-smoker, unknown).
Dataset paper:
- Uzuner O, Goldstein I, Luo Y, Kohane I. Identifying patient smoking
status from medical discharge records. J Am Med Inform Assoc. 2008
Jan-Feb;15(1):14-24. PubmedLink
4. Concept Extraction:
2010 i2b2 NLP challenge dataset. The dataset is annotated for concepts that describe medical problem, treatment, and test.
Dataset paper:
- Uzuner Ö., South B., Shen S., DuVall S. (2011). "2010 i2b2/VA Challenge on Concepts, Assertions, and Relations in Clinical Text".
Journal of the American Medical Informatics Association. 2011;18:552-556 PubmedLink
5. Assertion Extraction:
2010 i2b2 NLP challenge dataset. The dataset is annotated for assertion values for medical problems. Assertion values include: present, absent, conditional, hypothetical, possible, not-patient.
Dataset paper:
- Uzuner Ö., South B., Shen S., DuVall S. (2011). "2010 i2b2/VA Challenge on Concepts, Assertions, and Relations in Clinical Text".
Journal of the American Medical Informatics Association. 2011;18:552-556 PubmedLink
6. Co-reference Resolution:
2011 i2b2 NLP challenge dataset. The dataset is annotated for noun-phrase co-reference information.
Dataset paper:
- Uzuner Ö., Bodnari A, Shen S, Forbush T, Pestian J, South BR. (2012). "Evaluating the state of the art in coreference resolution for electronic medical records". J Am Med Inform Assoc. 2012 Sep-Oct;19(5):786-91.
PubmedLink
7. Temporal Relation Extraction:
2012 i2b2 NLP challenge dataset. The dataset is annotated for temporal relations. Temporal
information in clinical narratives plays an important role in patients' diagnosis, treatment
and prognosis. In order to represent narrative information accurately, medical natural language
processing (MLP) systems need to correctly identify and interpret temporal information.
To promote research in this area, the Informatics for Integrating Biology and the Bedside
(i2b2) project developed a temporally annotated corpus of clinical narratives. This corpus
contains 310 de-identified discharge summaries, with annotations of clinical events, temporal
expressions and temporal relations.
Dataset paper:
- Sun W, Rumshisky A, Uzuner Ö. (2013). "Annotating temporal information in clinical narratives". J Biomed Inform. 2013 Dec;46 Suppl:S5-12.
PubmedLink
- Sun W, Rumshisky A, Uzuner Ö. (2013). "Evaluating temporal relations in clinical text: 2012 i2b2 Challenge". J Am Med Inform Assoc. 2013 Sep-Oct;20(5):806-13.
PubmedLink