Homework 4

Note: _ denotes subscript, so x_i is xi.

1. Let x_1,...,x_n, x_{n+1}, x_{n+2} be independent, with Pr(x_r = i) = f_i for i = 1...k. Let n_i be the number of x_1...x_n that are equal to i, and assume a Dirichlet(alpha_1,...,alpha_k) prior for (f_1,...,f_k).

a) Show that Pr(x_{n+1} = i | x_1,...,x_n) = (alpha_i + n_i) / (n+sum_i alpha_i)

b) Find a similar expression for Pr(x_{n+1} = i and x_{n+2} = j | x_1,...,x_n), in both the case i not equal to j, and the case i =j.

2. Let g_1,...,g_n be the (observed) genotypes of n individuals, let h_1,...,h_n be the (unobserved) haplotype pairs for the same individuals, and let f=(f_1,...f_N) be the population frequencies of the N possible haplotypes. Assume a Dirichlet prior for f.

The following algorithm describes a Gibbs sampling strategy for sampling from the conditional distribution Pr(h_1,...,h_n | g_1,...,g_n): Start with some initial guess for h_1,...,h_n, and then iterate the following steps:

a) Choose i uniformly at random from i = 1...n.

b) Sample h_i from Pr(h_i | h_{-i}, g_1, ..., g_n) where h_{-i} denotes the set of current haplotype guesses, excluding individual i. Using results from 1 above, describe in detail how the step b) above would be performed.

3. Try applying Clark's algorithm (by hand) to estimate the haplotypes of each individual given the following genotype data on 5 SNPs in 6 individuals (each column is a SNP locus, each pair of rows is an individual). Does the algorithm give a unique solution?
00000
11111
00011
00010
00000
00010
01110
10010
11110
11110
00001
00011

4.

Instructions for using PHASE: