Note: The required parameter and pedigree files to run the examples below have been added to /datafiles/Lab4_auto on statgen.stat.washington.edu.
The goal of this week's lab is to run the program lm_auto which simulates inheritance and estimates ibd probabilities on specified individuals with an output very similar to that of ibddrop, but now the simulation is done conditional on marker and trait data.
However, in order to be able to run this and the final linkage analysis lab on your own pedigrees, we need first to create some data that is consistent with linkage etc. We do this using the MORGAN program markerdrop. Rather than give practice example, I will walk you through what I did in my example pedigree, so that you can do parallel things with yours.
In order to make m pedigree a bit more similar to yours, I have augmented my JV pedigree with two extra founders, making the spouses 2v6 and 2v4 into siblings, so that their offspring are double first cousins. The new pedigree file is at jvped_2014.ped and there is a picture of it here.
First you will need to modify your pedigree file, adding columns for trait data. In jvped_2014.ped I added 3 columns, In the first, there are no observed trait data (all values 0). In the second, the inbred individual fred is affected (value 2) and his two parents unaffected value 1). In the third both fred and a pair of bilateral relatives (in my file grandma and 3v3) are affected (2) and several other individuals including fred's two parents are unaffected (1).
The markerdrop parameter file markerdrop.par specifies that we will simulate marker data conditional on the trait data in this third trait column ("integer 4", as the first integer is gender). It then gives a genetic map for our markers and the location of the trait locus relative to the markers (just as in ibddrop) and also now gives allele frequencies for the five 4-allele markers (the same for each marker in this example). It gives the trait parameters as a fairly rare, almost (but not absolutely) recessive trait. and then the frequencies of the two alleles at the trait locus.
Finally it also give the pattern of observed marker data we want to simlate. In my example, I have specified that the seven individuals observed in that third trait column will be the ones observed, and they will all be observed for each of their two alleles and each of the 5 markers. That is why there are 10 "1"'s after each name.
Now we are ready to run markerdrop by saying
% markerdrop markerdrop.par > markerdrop.out
Note that to run markerdrop I need a seed file such as
sampler.seed
which I just continued from my Lab-3.
In output file markerdrop.out is a lot of info about what it thinks it simulated, but the important part is the seven lines of marker data right at the end. These are going to be my marker data for the rest of my analyses, and I cut and paste them into another file jvped_2014.markers which I am going to use in all the other programs in Lab-4 and Lab-5. (These data ended up a little weird in some respects -- any set of enough random numbers contain some a bit weird! -- But it seems to give OK results so we go with it.)
Markerdrop assignment:
As in ibddrop we need to decide what gametes we are going to score,
and I chose to do the four gametes of my bilateral relatives
(the double-first-cousins, grandma and 3v3), and also another
set of 4 consisting of the two of my inbred individual fred, with
one from each of grandma and 3v3. To check things were working
I first ran ibddrop using the parameter file
ibddrop_2014.par:
% ibddrop ibddrop_2014.par > ibddrop_2014.out
At the end of my output file ibddrop_2014.out are the two probability tables, each of (potentially) 15 states, for the two sets of 4 gametes.
Now on to the lm_auto program: let's look at the first parameter file lm_auto_1.par. (In fact the only difference between the three lm_auto_n.par parameter files is the "select trait n" statement, where n is 1, 2 or 3.)
First we tell it the input files in the usual way -- pedigree file, marker data file, and seed file. Then we specify the genetic maps, and now must also specify the allele frequencies at the marker loci and trait locus. For convenience, I have given the same 4 marker allele frequencies at each of markers 1,2,3,4,5, so I can condense it to one statement.
Then there is a section wich looks messy, but only because I decided to put in all three traits -- in our case the trait modes are all the same -- our rare, almost recessive -- but we use the idea of different traits to accommodate our three sets of different trait data. Then we specify our sets of proband gametes, just as we did for ibddrop.
Finally, there is some information on the MCMC, which you do not need to worry about. For those who care, the first three statements tell it how to get its initial pattern of inheritance, and how to get started. The last 5 statements tell it about how to do the MCMC. The multiple meiosis sampler is our better MCMC sampler; however, it takes longer, so I have reduced the number of MCMC scans to 10,000. Sample by scan means that it will update all the inheritance patterns in turn, before the next round of updating. The number of MC iterations is now the number of scans. Finally, our MCMC sampler is a mixture or two samplers, which we call L-sampler and M-sampler. Here we are asking it to use the L-sampler a random 20% of the time, and the remaining 80% will be M-sampler.
Now run lm_auto on this parameter file by typing
% lm_auto lm_auto_1.par > auto1.out
Unlike our previous programs, this is not instantaneous -- it could
take up to about 1 minute to run. Be patient!
Now look at the output file auto1.out. The first part should be very similar to your ibddrop output. It also gives various information about what it is doing, both with regard to its data and trait information -- note it notices that in this first example is has "no trait data", but it still produced ibd probabilities conditional on marker data.
There are various summaries/diagnosics on the MCMC, but we just focus on the last part -- the tables of ibd probabilities at each locus for each of our two gamete sets.
Note these probabilities are quite different from the prior probabilities produced by ibddrop. Some states are excluded (prob 0) at some loci by the marker data at that locus. Because the loci are linked, neighboring loci are also affected. Probabilities are quite different at the different markers, and generally the probabilty mass is concentrated into a few states (at least if your marker data are reasonably informative, as these are).
Now run
% lm_auto lm_auto_2.par > auto2.out
with output file
auto2.out, and
% lm_auto lm_auto_3.par > auto3.out
with output file
auto3.out.
Each time we add more trait data, the ibd probabilities change dramtically, especially at the trait locus, but also at nearby markers.