BIOST 540 Spring 2010
Correlated Data Regression


Introduction to Correlated Data

Examples of correlated data are common: in household surveys, responses from two or more members of the same household are obtained; in toxicology, responses to a toxic agent are observed on individual animals from the same litter; in longitudinal studies, outcomes at several different times are observed on the same patient; in ecological studies, disease risks are compared across time or space. When outcomes are clustered (by household, litter, or patient), we use the term "clustered data" to describe data sets like these. The key feature of clustered data is that the outcomes on the related individuals or things from the same cluster are likely to be correlated (not statistically independent). Terms used to describe such data include "clustered data", "correlated data", and "dependent data". Proper analysis of clustered data requires taking the clustering into account -- it is NOT OKAY to ignore the clustering.

Intended Audience

The intended audience for this course is graduate students who have had an introduction to biostatistics, and who [a] understand basic probability concepts, such as random variables, expectation, variance and correlation; [b] understand basic statistical concepts such as the distinction between populations and samples from a population, parameter estimation, standard errors, hypothesis tests and confidence intervals, and [c] are able to carry out statistical analyses such as linear regression, analysis of covariance and logistic regression, and explain them to an epidemiological audience. Familiarity with standard epidemiologic study designs and their analysis is beneficial, as is previous experience with the statistical software package STATA.


BIOST/EPI 536 and BIOSTAT 518; or premission of instructor.

Rationale for this Course

The rationale for the study of correlated data, and for this course, is described briefly below:

1.    Correlated data is ubiquitous across all of the health sciences disciplines (not to mention many other disciplines outside of the health sciences).

2.      Ignoring the clustering is (most often) disastrous. To obtain correct statistical inference, it is essential to acknowledge the clustering, both in study design and in statistical analysis.

3.      Although correlated data are common, principles and methods for handling them receive little attention in most courses in biostatistics, statistics, or epidemiology. Consequently, without a course such as this one, there would be a gap in the training of students in these fields, virtually all of whom will encounter clustered data many times in their professional careers.

Purpose and Learning Objectives

The aims of this course are:

1.    to introduce the concepts of correlated data, to describe the basic structures of correlated data, and to explain how correlation arises in common study designs;

2.      to contrast the behavior of correlated data with uncorrelated data and to show how the behavior of correlated data influences design and statistical analysis;

3.      to show how to analyze correlated data arising from several common correlated data structures using statistical computing packages such as STATA and SAS; and

4.      to introduce more advanced topics in the analysis of correlated data.

As suggested by these aims, the course seeks to develop an understanding of correlated data, including how it arises, its implications for statistical inference, and how to accommodate it in statistical analysis. At the end of the course, the student should

1.    be able to recognize correlated data and explain how it arises;

2.    understand the impact of correlated data on design and statistical analysis;

3.    know the basic structures of correlated data;

4.    be able to formulate models for real-life correlated data and correctly interpret the parameters of the model;

5.    be able to choose appropriate analysis methods for correlated data and explain them to a non-statistical audience;

6.    know how to perform several methods of analysis of correlated data using statistical packages and be able to recognize situations that cannot be addressed by these techniques and that require expert assistance; and

7.    be familiar with some of the key references on correlated data and be prepared for the study of more advanced correlated data methods.

Course Web Page and Email


This page will be used for posting announcements, datasets and extra material as needed. Course notes will primarily be handed out in class.

Lectures, Homeworks, Assignments, Exams, and Grading

Class time will consist of lectures and class discussion of assigned readings and discussion assignments. Discussion assignments will be given approximately weekly and will include assigned readings, critiques of journal articles, data analyses, and detailed presentation and interpretation of results. Some assignments will require the use of the statistical package STATA (available in the Health Sciences Library computing lab), and other assignments can be done using a package of the student's choice (SAS, R and Splus may be useful in addition to STATA). There will be two exams: an in-class midterm exam on MAY 3, and a take-home final exam that will be distributed on JUNE 4. The examinations will include questions that test knowledge of definitions and concepts covered in the notes, understanding of these concepts, knowledge of appropriate methods of analysis in a given situation, and ability to correctly interpret results of a data analysis based on computer output. The final course grades will be based on the following components: Midterm Examination (50%), Final Examination (50%).

The final exam will be due Wednesday, JUNE 9 -- the day of the scheduled course final exam -- at 12:00 p.m.

Course Notes

Course notes will be handed out on a weekly basis throughout the course. The notes will form the basis for the required readings for the course, and will be augmented by assigned readings from statistical and epidemiological journals. There is no required text book for this course. However, the following books are recommended as useful references for future or for supplementary material for this course.

Diggle, P.J., Heagerty, P.J., Liang, K.-Y., and Zeger, S.L. (2002). Analysis of Longitudinal Data (2nd ed.). Oxford: Oxford University Press. (This is an excellent text that gives some mathematical theory as well as practical aspects and applications of methods for the analysis of longitudinal data. If you have the first edition, that will do quite well, though there are two excellent new chapters in the second edition on advanced material.)

Fitzmaurice, G.M., Laird, N.M., Ware, J.H. (2004). Applied Longitudinal Analysis Wiley. (This text provides an introductory presentation of longitudinal data methods suitable for graduate level work.

Important Dates

March 29: First Lecture.

May 3: Midterm Exam.

June 4: Final Exam distributed.

June 9: Final Exam due (12 p.m.)


 Last Updated: