STAT/BIOSTAT 550, 2014: Homework 3: Due April 18


1. The data for this question come from a study by Dr. Arno Motulsky and coworkers, and are published in Thompson et al. (1988; Am.J.Hum.Genet, 42, 113-124).
There are two loci, here labeled 1 and 2, each aith two alleles labeled A and B. In a Caucasian sample of 205 individuals typed, the counts of the two-locus phenotypes were
143, 35, 3, 17, 5, 0, 2, 0, and 0, respectively, for the nine types
A1A1,A2A2; A1A1,A2B2; A1A1,B2B2; A1B1,A2A2; A1B1,A2B2; A1B1,B2B2; B1B1,A2A2; B1B1,A2B2; B1B1,B2B2.
Use the EM algorithm to estimate the 4 haplotype frequencies.

2. This very small example is unrealistic, but makes a point we will see in #3. We have a sample of just one individual who is heterozygous at both of two loci: A1B1, A2B2. Denote the frequencies of the four haplotypes A1A2, A1B2, B1A2, and B1B2, by qAA qAB qBA and qBB.
Show that there are two maximum likelihood estimates (of equal likelihood):
qAA = qBB = 1/2, qAB = qBA = 0 and qAB = qBA = 1/2, qAA = qBB = 0.

3. Suppose now we try to apply the EM algorithm to the data of question 2. What will the EM algorithm do? -- it will depend where you start -- figure some possibilities. (These 2 questions show the problem that can arise in trying to use the EM algorithm to estimate haplotype frequencies.)

4. Here are the SNP genotypes of 6 individuals at 5 loci, with the alleles labelled 0 and 1. Each pair of digits is the genotype at a SNP locus, and each row is an individual.
(Note these are the exact same data you will run PHASE or fastPHASE on in Lab-1.)
10, 10, 10, 10, 10
00, 00, 00, 11, 10
00, 00, 00, 10, 00
10, 10, 10, 11, 00
11, 11, 11, 11, 00
00, 00, 00, 10, 11
Using Clark's algorithm (lecture notes 1.5.3) find an estimate of the haplotypes of each individual. Does the algorithm give a unique solution?