This one takes me longer to write and explain, than it will take you to do it!
This week we have two MORGAN programs, ibddrop and lm_auto. Both these programs evaluate multilocus identity-by-descent probabilities, and both do it by Monte Carlo simulation. As these are more complicated programs, this time you will just run the already set-up examples and examine the output, rather than trying to create you own files to run on your own pedigree (unless you want to!).
The tutorial examples are again in the IBD subdirectory of your MORGAN_Examples directory. In fact, in the MORGAN Tutorial examples, the same example pedigree file you used for kin is used. This consists of two replicates of the JV pedigree. You can find out more about ibddrop in Chapter 7 of the Tutorial, and about lm_auto in the first half of Chapter 9.
However, the tutorial examples as given are not scientifically interesting, so instead we will run somewhat similar examples. Download the following four files, to the IBD subdirectory of your MORGAN_Examples directory.
Since we are doing Monte Carlo, we need a seed for the random number generator -- our particular generator uses a pair of numbers in hexadecimal -- that's why they look weird. We could use (fixed) default seeds, but that is boring: we get the same answers every time, and don't see the Monte Carlo variation. Some random number generators use a random seed based on the exact time, but we don't want to do that -- in developing software especially, it is sometimes very useful to be able to rerun from the same seed to check things. We therefore read a seed from a seed file, using a parameter statement. If you look in the parameter files you will find the statement input seed file . The program will read its starting seed from the specified file. There is also an output overwrite seed file statement. If this statement in included, at the end the program outputs the seed to the named output seed file, overwriting what is there. Then when another run is started, the program can use this new seed. NOTE: This should also be how the tutorial examples of ibddrop and lm_auto run, but right now they do not output seeds.
Now we are ready to start on the first ibddrop example. Look at the parameter file ibddrop1.par First we specify the pedigree file and seed files as described above. Next we tell it that it is to simulate "markers 5 trait 1. (Remember only the first 4 letters of each key word are significant.) In fact, to ibddrop, this just means 6 loci, since there are no data, but it is convenient to use the same statements that we will use for lm_auto. Next we specify the marker map, in terms of recombination fractions between adjacent markers. Since we don't give the map a gender, it assumes it is both the` male and the female marker map. Then we tell it where to put the trait locus relative to the markers: "marker 2 recomb frac .112". This means put the trait locus to the right of marker 2, at recombination fraction .112 to marker 2. (If we wanted it to the left of all the markers, we would say "marker 0 reco frac 0.112", and in this special case this would mean the recombination fraction to marker 1 is 0.112.) In fact, I have given a very slightly different recombination fraction for the male and female meioses, just to show we can!
Next, we have to tell ibddrop what ibd patterns to score. In this version, we give it some sets of haplotypes, and it will score all the possible patterns of ibd among them, scoring locus by locus. (This will probably be clearer when you look at the output.) We specify a haplotype by the ID "name" of the individual, and by a 0 (for maternal) or 1 (for paternal) indicator. Thus grandma 0 3v3 1 means score the ibd between the maternal haplotype of grandma and the paternal haplotype of 3v1. In this example, we are actually scoring the same thing on each of the three copies of the JV pedigree -- ibd between these two haplotypes, and then ibd among the 4 haplotypes (recall, there will be up to 15 possible patterns among 4 ordered haplotypes) consisting of the two of the final individual and these two. Finally, we tell ibddrop how many realizations to do: that is, how many times it will simulate descent at these 6 linked loci, down the pedigree. The number 40000 specified here does not take long, and is enough to get good estimates.
Now run ibddrop by typing
% ibddrop ibddrop1.par > ibd1.out
This may give you some warning messages to the screen -- maybe about seeds,
but will send the main output to ibd1.out. It generates quite a
bit of output, so it is probably easier to look at it in a file.
Now look at the output. A lot of MORGAN output is involved in it telling you what it understood you to tell it to do. This can be a bit tedious, but is well worth checking!! Most of the silly errors we have made in running MORGAN were because there was something a little bit wrong with a parameter file, and it did what the parameter file said.
First it prints little pictures of the maps, showing the markers and traits. It converts recombination fractions to centiMorgans (we will meet this very soon). Since we told it a slightly different male from female map it prints both. Next it tells us what haplotypes (it calls them proband gametes) we told it to score, on each pedigree component -- and it will check that the specified individuals are indeed in the right component pedigree. Then it reminds us that we asked for 40000 realizations (it calls them MC iterations). This is the end of its checking of parameter statements -- and it tells us so!
Now it opens the pedigree file, and does the by now familiar checking and summary of the pedigree. Then it reopens it to start its simulations: this is because of the way pedcheck works, producing a reordered pedigree for other MORGAN programs, if necessary. Next it tells us how many meioses it will simulate: 18 on each copy of the JV pedigree. You may find this a bit odd: there are 10 non-founders, each with a maternal and paternal meiosis. The reason there are only 18, not 20, is because it does not bother with one from any founder who has only one kid. (We may discuss this, some time.) It tells us its seeds (useful if we ever want to rerun the exact same thing), its map again (this time in recombination fractions, which is what it actually uses to simulate, and since it has found we asked for sets of 2 and of 4 haplotypes, it reminds us that for 2 there are 2 "patterns" (ibd or not) while for 4 there are 15.
Then, it does its stuff, and prints out the results, just counting up what it has simulated, locus by locus. The pair of individuals are cousins -- the true probability of ibd at each locus is 0.25. You will see some Monte Carlo variation, but you should see that at every locus the probability of ibd (the pattern 1 1) is about 0.25, and of non-ibd (the pattern 1 2) is about 0.75. Now look at the set of 15 patterns among the four haplotypes. Remember the inbreeding coefficient of the final individual is 7/64 -- how would you find the Monte Carlo estimate of this from this table? Since each component pedigree is the same, and the same haplotypes are scored, you should see the same (up to Monte Carlo variation) results for each component. Do you see more/less similarity between the different components, or between the loci in a given component? Which would you expect to be more similar? Why? Which loci do you expect to have most similar results? Why?
Assignment part 1
Write a brief paragraph giving your answers to these questions, and/or summarizing anything else you see interesting in the results.
This version of ibddrop shows a different way of scoring the multilocus ibd. In A1 we scored locus-by-locus, and so ended up with probabilities that were the same (on average) at each locus, and didn't really show the dependence between linked loci.
Look at the parameter file ibddrop2.par. Down to the scoring the information is the same, although the marker map information looks a bit different. I gave the male map in (Haldane) centiMorgans rather than recombination fractions, but if fact it's the same map (22.3 cM is a recombination fraction of 0.18, and 11.15 cM is recombination fraction 0.1). The request for 40000 Monte Carlo realizations at the bottom of the file is also as before.
The difference is in the scoring patterns: now we request the program to score over overlapping windows of 3 consecutive loci. Since our loci are ordered on the chromosome as (M1,M2,T,M3,M4,M5) these windows will be (M1,M2,T), (M2,T,M3), (T,M3,M4), and (M3,M4,M5). What it will score will be Yes(1)/No(0) events, and so simplest is just to give it pairs of haplotypes (Proband gametes) and ask it to score ibd (Yes=1) or not (No =0). So we give it five such pairs which are much the same as for A1: the two of the final individual, and a pair of cousin grandparents, but now only 2 at a time.
This will be clearer on looking at the output, so
now run ibddrop on this second parameter file by typing
% ibddrop ibddrop2.par > ibd2.out
The output
consists of a list of possible ibd patterns for the three loci and four columns
of relative frequencies. Remember that we are looking at 3 loci at a time. The
ibd pattern 0 1 0, for example, is the event that the individual's two
alleles
are not ibd at the first locus, are ibd at the second, and aren't ibd at the
third, where first, second, and third refer to each set of 3 loci in a
window.
Here are some things you can think about to see whether the output makes sense:
Assignment part 2
Make some of the comparisons suggested in the previous paragraph and report on them, in a brief paragraph. Say clearly which numbers you expect to be equal, and why.
Now we are ready to look at lm_auto. We will use the same loci and scoring of ibd as in A1, but now our realizations are MCMC realizations that are conditional on trait and marker data.
Recall
the trait data on the three copies of the JV pedigree is the fifth data
column in the pedigree
file -- see the description of ped45 above.
So look again at ped45.
The code 0 means unobserved, so on the
first JV pedigree there are no trait data (but there will be marker data).
The second JV pedigree has the final individual fred with phenotype 4.
This is our code for being affected with a recessive trait.
The third JV pedigree has this same phenotype for the final individual
(now a female 5w1), but
additionally I have specified two of her great-grand-parents to be known
carriers of this recessive allele (code = 3). (In real life, we'd
be unlikely to know this, but we might be doing a "what if" exercise.)
Now look at
the parameter file lm_auto.par.
The first part is much as before, except that instead of telling it to
simulate 5 markers and 1 trait, we tell it to select
for analysis all markers and 1 trait. We will give it data on
5 markers, so that is really no change. Next we tell it about the
markers, giving it the same marker map as before.
Each marker has 4 alleles, and the marker allele
frequencies for 4 of the markers are the same. Marker 3 has slightly
different frequencies for its 4 alleles, and allele 2 for the traits
(the disease allele) has frequency 0.05. (So our recessive
trait is not very rare.)
Next we give it the marker data. In this rather artificial example we have not only the same marker data on each of the three JV pedigrees, but the same data at each of the five linked markers! There are just 5 individuals observed on each JV pedigree -- the final individual, one parent and three grandparents. The final individual is homozygous 3 3 and each of his typed ancestors carries one 3 allele. It is therefore interesting which of these type-3 alleles are ibd. In fact, these marker data are the same as the example in the book, and also the one we discussed in class at the end of chapter 2 of the notes. Below the marker data are the haplotypes (proband gametes) to be scored, just as before.
Finally, there is some information on the MCMC, which you do not need to worry about. For those who care, the first three statements tell it how to get its initial pattern of inheritance, and how to get started. The last 4 statements tell it about how to do the MCMC. Sample by scan means that it will update all the inheritance patterns in turn, before the next round of updating. The number of MC iterations is now the number of scans. Because it may take a while, we ask it to let us know how it is doing every 10000 scans. Finally, our MCMC sampler is a mixture or two samplers, which we call L-sampler and M-sampler. Here we are asking it to use the L-sampler a random 20% of the time, and the remaining 80% will be M-sampler.
Now run lm_auto on this parameter file by typing
% lm_auto lm_auto.par ped "ped45" > lm_auto.out
On abacus, this took about 4 minutes to run. Be patient!
If it is really taking too long, you may reduce the number of MC
iterations from 40000 to (say) 10000, but your results will not be so
good.
Now look at the output file lm_auto.out. The first part should be very similar to your ibd1.out output. For the results of the MCMC, focus just on the tables of IBD probabilities, scored locus-by-locus just as in A1.
Assignment part 3
Your assignment here is to explain some of these results in relation to
the no-data ibddrop results, and to the three levels of data
we give in lm_auto. Be brief in your answers -- a brief paragraph for
each of 1,2,3 is enough.