Introduction to Correlated Data
Examples of correlated data are common: in household surveys,
responses from two or more members of the same household are
obtained; in toxicology, responses to a toxic agent are observed
on individual animals from the same litter; in longitudinal
studies, outcomes at several different times are observed on the same
patient; in ecological studies, disease risks are compared across time
or space. When outcomes are clustered (by household, litter, or patient),
we use the term "clustered data" to describe data sets like these. The
key feature of clustered data is that the outcomes on the
related individuals or things from the same cluster are likely to be
correlated (not statistically independent). Terms used to describe such
data include "clustered data", "correlated data", and "dependent data".
Proper analysis of clustered data requires taking the clustering into
account  it is NOT OKAY to ignore the clustering.
Intended Audience
The intended audience for this course is graduate students who have
had an introduction to biostatistics, and who [a] understand basic
probability concepts, such as random variables, expectation, variance
and correlation; [b] understand basic statistical concepts such as
the distinction between populations and samples from a population,
parameter estimation, standard errors, hypothesis tests and confidence
intervals, and [c] are able to carry out statistical analyses such as
linear regression, analysis of covariance and logistic regression, and
explain them to an epidemiological audience. Familiarity with standard
epidemiologic study designs and their analysis is beneficial, as is
previous experience with the statistical software package STATA.
Prerequisites
BIOST/EPI 536 and BIOSTAT 518; or
premission of instructor.
Rationale for this Course
The rationale for the study
of correlated data, and for this course, is described briefly below:
1.
Correlated data is ubiquitous across all of the health sciences disciplines
(not to mention many other disciplines outside of the health sciences).
2.
Ignoring the clustering is (most often) disastrous. To obtain correct
statistical inference, it is essential to acknowledge the clustering,
both in study design and in statistical analysis.
3.
Although correlated data are common, principles and methods for handling
them receive little attention in most courses in biostatistics, statistics,
or epidemiology. Consequently, without a course such as this one, there
would be a gap in the training of students in these fields, virtually all
of whom will encounter clustered data many times in their professional
careers.
Purpose and Learning Objectives
The aims of this course are:
1.
to introduce the concepts of correlated data, to describe the basic
structures of correlated data, and to explain how correlation arises
in common study designs;
2.
to contrast the behavior of correlated data with uncorrelated data
and to show how the behavior of correlated data influences design
and statistical analysis;
3.
to show how to analyze correlated data arising from several common
correlated data structures using statistical computing packages such
as STATA and SAS; and
4.
to introduce more advanced topics in the analysis of correlated data.
As suggested by these aims,
the course seeks to develop an understanding of correlated
data, including how it arises, its implications for statistical
inference, and how to accommodate it in statistical analysis. At
the end of the course, the student should
1.
be able to recognize correlated data and explain how it arises;
2.
understand the impact of correlated data on design and statistical
analysis;
3.
know the basic structures of correlated data;
4.
be able to formulate models for reallife correlated data and correctly
interpret the parameters of the model;
5.
be able to choose appropriate analysis methods for correlated data and
explain them to a nonstatistical audience;
6.
know how to perform several methods of analysis of correlated data using
statistical packages and be able to recognize situations that cannot be
addressed by these techniques and that require expert assistance; and
7.
be familiar with some of the key references on correlated data and be
prepared for the study of more advanced correlated data methods.
Course Web Page and Email
http://faculty.washington.edu/yanez/b540/.
This page will be used for posting announcements, datasets and extra
material as needed. Course notes will primarily be handed out in class.
Lectures, Homeworks, Assignments, Exams, and Grading
Class time will consist of
lectures and class discussion of assigned readings and discussion
assignments. Discussion assignments will be given approximately weekly and will
include assigned readings, critiques of journal articles, data analyses,
and detailed presentation and interpretation of results. Some assignments
will require the use of the statistical package STATA (available in the
Health Sciences Library computing lab), and other assignments can be done
using a package of the student's choice (SAS, R and Splus may be useful
in addition to STATA). There will be two exams: an inclass midterm
exam on MAY 3, and a takehome final exam that will be distributed on
JUNE 4. The examinations will include questions that
test knowledge of definitions and concepts covered in the notes,
understanding of these concepts, knowledge of appropriate methods of
analysis in a given situation, and ability to correctly interpret
results of a data analysis based on computer output. The final course
grades will be based on the following components: Midterm Examination
(50%), Final Examination (50%).
The final exam will be due Wednesday, JUNE 9  the day of the
scheduled course final exam  at 12:00 p.m.
Course Notes
Course notes will be
handed out on a weekly basis throughout the course. The notes
will form the basis for the required readings for the course,
and will be augmented by assigned readings from statistical
and epidemiological journals. There is no required text book
for this course. However, the following books are recommended
as useful references for future or for supplementary material
for this course.
Diggle, P.J., Heagerty, P.J., Liang, K.Y., and Zeger, S.L. (2002).
Analysis of Longitudinal Data (2nd ed.). Oxford: Oxford University
Press. (This is an excellent text that gives some mathematical theory
as well as practical aspects and applications of methods for the analysis
of longitudinal data. If you have the first edition, that will do quite
well, though there are two excellent new chapters in the second edition
on advanced material.)
Fitzmaurice, G.M., Laird, N.M., Ware, J.H. (2004). Applied Longitudinal
Analysis Wiley. (This text provides an introductory presentation
of longitudinal data methods suitable for graduate level work.
Important Dates
March 29: First Lecture.
May 3: Midterm Exam.
June 4: Final Exam distributed.
June 9: Final Exam due (12 p.m.)
