STAT/BIOST 550; Spring 2014

STAT/BIOSTAT 550, 2014: Homework 3: Due April 18

1. The data for this question come from a study by Dr. Arno Motulsky and coworkers, and are published in Thompson et al. (1988; Am.J.Hum.Genet, 42, 113-124).
There are two loci, here labeled 1 and 2, each aith two alleles labeled A and B. In a Caucasian sample of 205 individuals typed, the counts of the two-locus phenotypes were
143, 35, 3, 17, 5, 0, 2, 0, and 0, respectively, for the nine types
A₁A₁,A₂A₂; A₁A₁,A₂B₂; A₁A₁,B₂B₂; A₁B₁,A₂A₂; A₁B₁,A₂B₂; A₁B₁,B₂B₂; B₁B₁,A₂A₂; B₁B₁,A₂B₂; B₁B₁,B₂B₂.
Use the EM algorithm to estimate the 4 haplotype frequencies.

2. This very small example is unrealistic, but makes a point we will see in #3. We have a sample of just one individual who is heterozygous at both of two loci: A₁B₁, A₂B₂. Denote the frequencies of the four haplotypes A₁A₂, A₁B₂, B₁A₂, and B₁B₂, by q_AA q_AB q_BA and q_BB.
Show that there are two maximum likelihood estimates (of equal likelihood):
q_AA = q_BB = 1/2, q_AB = q_BA = 0 and q_AB = q_BA = 1/2, q_AA = q_BB = 0.

3. Suppose now we try to apply the EM algorithm to the data of question 2. What will the EM algorithm do? -- it will depend where you start -- figure some possibilities. (These 2 questions show the problem that can arise in trying to use the EM algorithm to estimate haplotype frequencies.)

4. Here are the SNP genotypes of 6 individuals at 5 loci, with the alleles labelled 0 and 1. Each pair of digits is the genotype at a SNP locus, and each row is an individual.
(Note these are the exact same data you will run PHASE or fastPHASE on in Lab-1.)

10, 10, 10, 10, 10
00, 00, 00, 11, 10
00, 00, 00, 10, 00
10, 10, 10, 11, 00
11, 11, 11, 11, 00
00, 00, 00, 10, 11
Using Clark's algorithm (lecture notes 1.5.3) find an estimate of the haplotypes of each individual. Does the algorithm give a unique solution?