Stat550: Lab 3: PHASE and MORGAN/ibd_haplo

This week's lab will cover two programs.
Note that some write-up is requested for each program.

If using the Biostat linux computers:
Remember that if you did not discover how to put the source ~statgen/.statgen.cshrc
into your own .cshrc (or equiv.) file, you will need to give this command each time you log on, to access the statgen programs.

A: PHASE Assignment

The program PHASE is due to Matthew Stephens, and produces estimates of haplotypes of individuals given their genotypes at mutiple loci. It uses a model for similarities among haplotypes in a population, and samples phasings of the genotypic data under this model.
PHASE is installed on the Biostat computers.

Alternatively, you may wish to try the more modern fastPHASE, which is available through Mathew Stephens page at Chicago.

Here are the full instructions for the PHASE software, if you need them, or are interested in understanding the input format etc. (The PHASE instructions are still here as of April 25, 2012.)

Here are the data on 6 individuals at 5 SNP loci that you applied Clark's algorithm to in Homework 6. Each row is one individual.
10, 10, 10, 10, 10
00, 00, 00, 11, 10
00, 00, 00, 10, 00
10, 10, 10, 11, 00
11, 11, 11, 11, 00
00, 00, 00, 10, 11

Use the PHASE program to estimate the haplotype frequencies and haplotypes for each individual.
The data set in the file phase.inp. is the same data as above, but in Matthew Stephens' format!

To run PHASE:

  • 1. Go to the small PHASE example input data file phase.inp. Copy this PHASE example file phase.inp, into your b550 working directory on hercules -- or to wherever you want to run PHASE.
  • 2. run PHASE by typing, for example :
    % PHASE phase.inp phase.out
    The program prints some information to the screen and also produces several output files with names starting phase.out_... .
  • 3. examine the output in the file phase.out_pairs (which lists the most probable haplotype pairs for each individual, along with their estimated probabilities), and the file phase.out_freqs which gives estimates of the population frequencies of each haplotype.

    Write a couple of brief paragraphs explaining the output, and comparing your estimates with the estimates you got from Clark's algorithm in your Homework 6.

    B: MORGAN/ibd_haplo assignment

    ibd_haplo is a relatively new MORGAN program, so it is not yet included in the online turotial, and is not yet very well documented. However, what it does is to use an HMM model to detect segments of ibd among sets of chromosomes. The program is primarily designed to look at the 15 states of ibd among the four chromosomes of two individuals. However, it can also be run on a pair of chromosomes: for example, the two chromosomes of an individual. The model is then essentially equivalent to that of the Leutenegger et al. (2003) paper discussed in class.

    The data file ibd_test1.markers gives marker genotypes at 2000 very closely linked SNPS for 8 individuuls. The parameter file ibd_haplo.par specifies the input data file and gives a number of other parameters it will use in running ibd_haplo. Download these two files to wherever you are running MORGAN prgrams. Check your system knows where to find the program by saying
    % which ibd_haplo

    You may now run ibd_haplo by saying
    % ibd_haplo haplo_pair.par > haplo_pair.out

    You will have some descriptive but not very useful output in the file haplo_pair.out. The useful output is in the file qibd_lab3.out (This file name was specified in the parameter file). This file contains about 16000 lines, one for each if the 2000 markers for each of the 8 individuals. Each line, apart from the header lines, consists of 4 numbers:
    Marker number, Marker position (in Mbp), ibd-probability, non-ibd probability

    Find some segments of inferred ibd between the two chromosomes of any of these 8 individuals -- that is, high probabilties in the third column of the output. You should be able to find at least 3 segments in total. Wrte a brief paragraph, specifying the individual, and the marker numbers/positions of these segments.
    Hint: I did it quite crudely, by searching for the pattern of three spaces followed by 0.9, and then I also tried three spaces followed by 0.8 to find other less certain segments.
    ibd_haplo is still not so user friendly; it is a work in progress.