1. The data for this question come
from a study by Dr. Arno Motulsky and coworkers, and
are published in Thompson et al. (1988; Am.J.Hum.Genet, 42, 113-124).
There are two loci, here labeled 1 and 2, each aith two alleles labeled
A and B.
In a Caucasian sample of 205 individuals typed,
the counts of the two-locus phenotypes were
143, 35, 3, 17, 5, 0, 2, 0, and 0, respectively, for the nine types
A1A1,A2A2;
A1A1,A2B2;
A1A1,B2B2;
A1B1,A2A2;
A1B1,A2B2;
A1B1,B2B2;
B1B1,A2A2;
B1B1,A2B2;
B1B1,B2B2.
Use the EM algorithm to estimate the 4 haplotype frequencies.
2. This very small example is unrealistic, but makes a point we will see in #3.
We have a sample of just one individual who is heterozygous at both of
two loci: A1B1, A2B2.
Denote the frequencies of the four haplotypes
A1A2,
A1B2,
B1A2, and
B1B2,
by
qAA
qAB
qBA and
qBB.
Show that there are two maximum likelihood estimates (of equal likelihood):
qAA = qBB = 1/2,
qAB = qBA = 0 and
qAB = qBA = 1/2,
qAA = qBB = 0.
3. Suppose now we try to apply the EM algorithm to the data of question 2. What will the EM algorithm do? -- it will depend where you start -- figure some possibilities. (These 2 questions show the problem that can arise in trying to use the EM algorithm to estimate haplotype frequencies.)
4. Here are the SNP genotypes of 6 individuals at 5 loci, with the alleles
labelled 0 and 1. Each pair of
digits is the genotype at a SNP locus, and each row is an individual.
(Note these are the exact same data you will run PHASE or fastPHASE on
in Lab-1.)
10, 10, 10, 10, 10 |
00, 00, 00, 11, 10 |
00, 00, 00, 10, 00 |
10, 10, 10, 11, 00 |
11, 11, 11, 11, 00 |
00, 00, 00, 10, 11 |