Stat550: 2014. Lab 1: A: EM using R and B: PHASE

This week's lab will include two problems.
Note that some write-up is requested for each.

A: EM algorithm for ABO frequencies, using R

The Statistics R software is installed on statgen.stat.washington.edu.
Alternatively, it is freely available for download
To invoke R, the command is R. To quit R, say q().
To get help in R, say help or ? with name of function or idea you want help on.

  • 1. The attached file abo_fit.r contains a function that implements the EM algorithm for the ABO blood group allele frequencies. The second file abo_source.r contains lines of R-code that you can use to call this function. Download these files, and cut and paste the commands from the first part of abo_source.r to estimate the ABO allele frequencies for the given sample which consists of (22, 14, 46, 18) individuals of the four phenotypes (A, AB, B, O) respectively. The estimated allele frequencies are stored in the vector my.fit. The values you should see in my.fit are 0.2008183, 0.3690472, and 0.4301345, for the frequencies of A, B and O alleles, respectively.
  • 2. Are these data a good fit to the model for the ABO blood type frequencies?
  • 3. Cut and paste the commands in the second part of the file abo_source.r to look at the variance of the estimator. This code generates 1000 samples from the fitted model, and re-estimates the allele frequencies. We can therefore look at the empirical variance of these estimates. We can plot them using hist(.), or find variances and covariances using var(.) and cov(.).
    Find estimates of the variances of the estimators of each of the three allele frequencies.
  • 4. Explore the variance-covaiance matrix of the three estimators. Is it singular? Why?

    B: PHASE Assignment

    The program PHASE is due to Matthew Stephens, and produces estimates of haplotypes of individuals given their genotypes at mutiple loci. It uses a model for similarities among haplotypes in a population, and samples phasings of the genotypic data under this model.
    PHASE is installed on statgen.stat.washington.edu.

    Alternatively, you may wish to try the more modern fastPHASE, which is also at statgen.stat.washington.edu.

    If you wish to download your own versions, PHASE and fastPHASE are available through Mathew Stephens page at Chicago, and there are also links at the StatGen Computing Info page.

    Here are the full instructions for the PHASE software, if you need them, or are interested in understanding the input format etc. (The PHASE instructions were still here as of April 11, 2013, but maybe no longer....)

    Here are the data on 6 individuals at 5 SNP loci that you applied Clark's algorithm to in Homework 3. Each row is one individual.
    10, 10, 10, 10, 10
    00, 00, 00, 11, 10
    00, 00, 00, 10, 00
    10, 10, 10, 11, 00
    11, 11, 11, 11, 00
    00, 00, 00, 10, 11

    Use the PHASE program to estimate the haplotype frequencies and haplotypes for each individual.
    The data set in the file phase.inp. is the same data as above, but in Matthew Stephens' format!

    To run PHASE:

  • 1. Go to the small PHASE example input data file phase.inp. Copy this PHASE example file phase.inp, into your working directory on statgen.stat -- or to wherever you want to run PHASE.
  • 2. run PHASE by typing, for example :
    % PHASE phase.inp phase.out
    The program prints some information to the screen and also produces several output files with names starting phase.out_... .
  • 3. examine the output in the file phase.out_pairs (which lists the most probable haplotype pairs for each individual, along with their estimated probabilities), and the file phase.out_freqs which gives estimates of the population frequencies of each haplotype.

    Write a couple of brief paragraphs explaining the output, and comparing your estimates with the estimates you got from Clark's algorithm in your Homework 3.