STAT 550 (DL); Homework 3

STAT 550 (DL) : Homework 3

Only 3 questions this week, as they are somewhat heavy, and we are still recovering from Lab 1. For this homework, refer to the third audio lecture (Chapter 1 part 3); sections 1.5 and 1.6 of the notes.

1. (Weir Exercise 2.2: data of S.Ohba, discussed by Yasuda and Kimura 1968) In a system with 4 alleles, O is recessive to S, M and F, which are codominant. In a sample of 1786 individuals, 1149 are SS or SO, 36 are MM or MO, 17 are FF or FO, 20 are OO, 336 are SM, 25 are MF, 203 are SF.
Assuming Hardy-Weinberg equilibrium, use the EM algorithm to estimate the frequencies of the four alleles.

2. EM for estimating a population inbreeding coefficient.
For two alleles, in fact we don't need EM, so we'll do the non-EM version first and then (in the next question) see how EM would do it.

Consider a diallelic locus with two codominant alleles A and B, with frequencies q and (1-q) as usual. A model for a population is that it is a mixture of two sorts of individuals: A proportion (1-f) have genotypes in HWE, while a proportion f have two copies of a single gene. Both groups of individuals have the same A allele frequency q. (Put like that it sounds stupid, but we'll see why it isn't a bad model -- think about f as an alternative way, with two alleles, of modelling the Wahlund effect.)

(i) Show that the population genotype probabilities for genotypes AA,AB, and BB are q²+fq(1-q), 2(1-f)q(1-q), and (1-q)²+fq(1-q)
(ii) A random sample of n individuals is taken from the population. Suppose there are n₀, n₁, and n₂ individuals in the sample of types AA, AB, and BB respectively.
Show that the log-likelihood for the parameters q and f is (up to a const)
(n₀+n₁) log q +(n₂+n₁) log(1-q) + n₁ log(1-f) +n₀ log(q+f-fq) + n₂ log(1-q+fq)
Clearly, this is not an easy expression to maximize, just by differentiating.

(iii) Trick: what is the expectation of (2n₀+n₁)?
Show that if q is estimated as (2n₀+n₁)/2n, f can then be chosen to make the expected genotypes frequencies exactly fit the observed. Deduce that q=(2n₀+n₁)/2n and f= (4 n₀ n₂ - n₁²)/ (2n₀+n₁)(2n₂+n₁) are the MLEs of q and f.

3. Now, we'll do the EM version for multiple alleles: see the book P.34 if having difficulties. Now we want to estimate allele frequencies (p_i; i=1,...,) and f, where p₁ + ...+ p_k =1. Think of each individual as carrying a "flag" saying whether he has two copies of a single gene, or not. To make my HTML less painful, let n_i be the number of homozygous A_iA_i individuals in the total sample of size n diploid individuals, and let m_i be the number of heterozygotes carrying one A_i allele, so s_i = 2 n_i +m_i is the number of A_i alleles in the sample.

(i) Let X_i be the number of homozygous A_iA_i individuals in the necessarily homozygous "f" portion of the population, i=1,...,k. Shown that if X_i could be observed the log-likelihood would be
(2n₁+m₁ -X₁) log q₁ + ... + (2n_k+m_k - X_k) log q_k + (X₁+ ... + X_k) log f + (n - (X₁+...+ X_k)) log(1-f)
Denoting s_i as above, and T=(X₁+... +X_k), this reduces to
(s₁-X₁) log q₁ + ... + (s_k-X_k) log q_k + T log f + (n-T) log(1-f)

(ii) Show that for given q_i and f the expected value and given the data counts, the expected value of X_i is
f n_i / (f+ q_i -q_if)
Hence show that an EM algorithm for this problem would be to iterate:
E-step: X_i = f n_i / (f+q_i -q_i f), for i=1,...,k and T = X₁ + ...+ X_k
M-step: q_i = (s_i -X_i)/(2n - T), f = T/n

(iii) To show you understand, say what one round of the algorithm would look like for a data set with k=3, n=100, and and counts of A₁A₁, A₁A₂, A₁A₃, A₂A₂, A₂A₃ and A₃A₃ equal to 30, 20, 18, 10, 10 and 12 respectively.