STAT 550 (DL) : Homework 3


Only 3 questions this week, as they are somewhat heavy, and we are still recovering from Lab 1. For this homework, refer to the third audio lecture (Chapter 1 part 3); sections 1.5 and 1.6 of the notes.

1. (Weir Exercise 2.2: data of S.Ohba, discussed by Yasuda and Kimura 1968) In a system with 4 alleles, O is recessive to S, M and F, which are codominant. In a sample of 1786 individuals, 1149 are SS or SO, 36 are MM or MO, 17 are FF or FO, 20 are OO, 336 are SM, 25 are MF, 203 are SF.
Assuming Hardy-Weinberg equilibrium, use the EM algorithm to estimate the frequencies of the four alleles.

2. EM for estimating a population inbreeding coefficient.
For two alleles, in fact we don't need EM, so we'll do the non-EM version first and then (in the next question) see how EM would do it.

Consider a diallelic locus with two codominant alleles A and B, with frequencies q and (1-q) as usual. A model for a population is that it is a mixture of two sorts of individuals: A proportion (1-f) have genotypes in HWE, while a proportion f have two copies of a single gene. Both groups of individuals have the same A allele frequency q. (Put like that it sounds stupid, but we'll see why it isn't a bad model -- think about f as an alternative way, with two alleles, of modelling the Wahlund effect.)

(i) Show that the population genotype probabilities for genotypes AA,AB, and BB are q2+fq(1-q), 2(1-f)q(1-q), and (1-q)2+fq(1-q)
(ii) A random sample of n individuals is taken from the population. Suppose there are n0, n1, and n2 individuals in the sample of types AA, AB, and BB respectively.
Show that the log-likelihood for the parameters q and f is (up to a const)
(n0+n1) log q +(n2+n1) log(1-q) + n1 log(1-f) +n0 log(q+f-fq) + n2 log(1-q+fq)
Clearly, this is not an easy expression to maximize, just by differentiating.

(iii) Trick: what is the expectation of (2n0+n1)?
Show that if q is estimated as (2n0+n1)/2n, f can then be chosen to make the expected genotypes frequencies exactly fit the observed. Deduce that q=(2n0+n1)/2n and f= (4 n0 n2 - n12)/ (2n0+n1)(2n2+n1) are the MLEs of q and f.

3. Now, we'll do the EM version for multiple alleles: see the book P.34 if having difficulties. Now we want to estimate allele frequencies (pi; i=1,...,) and f, where p1 + ...+ pk =1. Think of each individual as carrying a "flag" saying whether he has two copies of a single gene, or not. To make my HTML less painful, let ni be the number of homozygous AiAi individuals in the total sample of size n diploid individuals, and let mi be the number of heterozygotes carrying one Ai allele, so si = 2 ni +mi is the number of Ai alleles in the sample.

(i) Let Xi be the number of homozygous AiAi individuals in the necessarily homozygous "f" portion of the population, i=1,...,k. Shown that if Xi could be observed the log-likelihood would be
(2n1+m1 -X1) log q1 + ... + (2nk+mk - Xk) log qk + (X1+ ... + Xk) log f + (n - (X1+...+ Xk)) log(1-f)
Denoting si as above, and T=(X1+... +Xk), this reduces to
(s1-X1) log q1 + ... + (sk-Xk) log qk + T log f + (n-T) log(1-f)

(ii) Show that for given qi and f the expected value and given the data counts, the expected value of Xi is
f ni / (f+ qi -qif)
Hence show that an EM algorithm for this problem would be to iterate:
E-step: Xi = f ni / (f+qi -qi f), for i=1,...,k and T = X1 + ...+ Xk
M-step: qi = (si -Xi)/(2n - T), f = T/n

(iii) To show you understand, say what one round of the algorithm would look like for a data set with k=3, n=100, and and counts of A1A1, A1A2, A1A3, A2A2, A2A3 and A3A3 equal to 30, 20, 18, 10, 10 and 12 respectively.