lab4

Stat550 (DL): Lab 4: MORGAN programs ibddrop and lm_auto

This one takes me longer to write and explain, than it will take you to do it!

This week we have two MORGAN programs, ibddrop and lm_auto. Both these programs evaluate multilocus identity-by-descent probabilities, and both do it by Monte Carlo simulation. As these are more complicated programs, this time you will just run the already set-up examples and examine the output, rather than trying to create you own files to run on your own pedigree (unless you want to!).

The tutorial examples are again in the IBD subdirectory of your MORGAN_Examples directory. In fact, in the MORGAN Tutorial examples, the same example pedigree file you used for kin is used. This consists of two replicates of the JV pedigree. You can find out more about ibddrop in Chapter 7 of the Tutorial, and about lm_auto in the first half of Chapter 9.

However, the tutorial examples as given are not scientifically interesting, so instead we will run somewhat similar examples. Download the following four files, to the IBD subdirectory of your MORGAN_Examples directory.

The pedigree file ped45,
the first ibddrop parameter file ibddrop1.par,
the second ibddrop parameter file ibddrop2.par,
and an lm_auto parameter file lm_auto.par.

You do not have to keep the same file names when you download them, but those are the names I will use in describing them, so you will find it easier if you do.

Random Seeds in MORGAN

In addition to the above four files, the programs will use the file sampler.seeds in the main MORGAN_Examples directory (one level up from IBD). Leave it there: the parameter files are set to find it there.. When you run the programs you will get a warning that this seed file will be overwritten: this is OK.

Since we are doing Monte Carlo, we need a seed for the random number generator -- our particular generator uses a pair of numbers in hexadecimal -- that's why they look weird. We could use (fixed) default seeds, but that is boring: we get the same answers every time, and don't see the Monte Carlo variation. Some random number generators use a random seed based on the exact time, but we don't want to do that -- in developing software especially, it is sometimes very useful to be able to rerun from the same seed to check things. We therefore read a seed from a seed file, using a parameter statement. If you look in the parameter files you will find the statement input seed file . The program will read its starting seed from the specified file. There is also an output overwrite seed file statement. If this statement in included, at the end the program outputs the seed to the named output seed file, overwriting what is there. Then when another run is started, the program can use this new seed. NOTE: This should also be how the tutorial examples of ibddrop and lm_auto run, but right now they do not output seeds.

The Pedigree file

Now ibddrop just simulates gene descent at linked loci down a pedigree, which is easy to do. The program lm_auto uses MCMC to simulate the same thing, but conditional on trait and/or marker data. So now let us look at the pedigree file ped45 -- all three parameter files specify this same pedigree file. The comments up the top say it all. It is just 3 copies of the 15-member JV pedigree, where I changed a few names to make it more interesting (??). The fourth column is gender, as usual. The fifth column, which we have ignored up to now, will still be ignored by ibddrop. For lm_auto this is the default place for the trait phenotypic data. For now, you can ignore it also: it will be described when we get to lm_auto below.

A1: First version of ibddrop

Now we are ready to start on the first ibddrop example. Look at the parameter file ibddrop1.par First we specify the pedigree file and seed files as described above. Next we tell it that it is to simulate "markers 5 trait 1. (Remember only the first 4 letters of each key word are significant.) In fact, to ibddrop, this just means 6 loci, since there are no data, but it is convenient to use the same statements that we will use for lm_auto. Next we specify the marker map, in terms of recombination fractions between adjacent markers. Since we don't give the map a gender, it assumes it is both the` male and the female marker map. Then we tell it where to put the trait locus relative to the markers: "marker 2 recomb frac .112". This means put the trait locus to the right of marker 2, at recombination fraction .112 to marker 2. (If we wanted it to the left of all the markers, we would say "marker 0 reco frac 0.112", and in this special case this would mean the recombination fraction to marker 1 is 0.112.) In fact, I have given a very slightly different recombination fraction for the male and female meioses, just to show we can!

Next, we have to tell ibddrop what ibd patterns to score. In this version, we give it some sets of haplotypes, and it will score all the possible patterns of ibd among them, scoring locus by locus. (This will probably be clearer when you look at the output.) We specify a haplotype by the ID "name" of the individual, and by a 0 (for maternal) or 1 (for paternal) indicator. Thus grandma 0 3v3 1 means score the ibd between the maternal haplotype of grandma and the paternal haplotype of 3v1. In this example, we are actually scoring the same thing on each of the three copies of the JV pedigree -- ibd between these two haplotypes, and then ibd among the 4 haplotypes (recall, there will be up to 15 possible patterns among 4 ordered haplotypes) consisting of the two of the final individual and these two. Finally, we tell ibddrop how many realizations to do: that is, how many times it will simulate descent at these 6 linked loci, down the pedigree. The number 40000 specified here does not take long, and is enough to get good estimates.

Now run ibddrop by typing % ibddrop ibddrop1.par > ibd1.out
This may give you some warning messages to the screen -- maybe about seeds, but will send the main output to ibd1.out. It generates quite a bit of output, so it is probably easier to look at it in a file.

Now look at the output. A lot of MORGAN output is involved in it telling you what it understood you to tell it to do. This can be a bit tedious, but is well worth checking!! Most of the silly errors we have made in running MORGAN were because there was something a little bit wrong with a parameter file, and it did what the parameter file said.

First it prints little pictures of the maps, showing the markers and traits. It converts recombination fractions to centiMorgans (we will meet this very soon). Since we told it a slightly different male from female map it prints both. Next it tells us what haplotypes (it calls them proband gametes) we told it to score, on each pedigree component -- and it will check that the specified individuals are indeed in the right component pedigree. Then it reminds us that we asked for 40000 realizations (it calls them MC iterations). This is the end of its checking of parameter statements -- and it tells us so!

Now it opens the pedigree file, and does the by now familiar checking and summary of the pedigree. Then it reopens it to start its simulations: this is because of the way pedcheck works, producing a reordered pedigree for other MORGAN programs, if necessary. Next it tells us how many meioses it will simulate: 18 on each copy of the JV pedigree. You may find this a bit odd: there are 10 non-founders, each with a maternal and paternal meiosis. The reason there are only 18, not 20, is because it does not bother with one from any founder who has only one kid. (We may discuss this, some time.) It tells us its seeds (useful if we ever want to rerun the exact same thing), its map again (this time in recombination fractions, which is what it actually uses to simulate, and since it has found we asked for sets of 2 and of 4 haplotypes, it reminds us that for 2 there are 2 "patterns" (ibd or not) while for 4 there are 15.

Then, it does its stuff, and prints out the results, just counting up what it has simulated, locus by locus. The pair of individuals are cousins -- the true probability of ibd at each locus is 0.25. You will see some Monte Carlo variation, but you should see that at every locus the probability of ibd (the pattern 1 1) is about 0.25, and of non-ibd (the pattern 1 2) is about 0.75. Now look at the set of 15 patterns among the four haplotypes. Remember the inbreeding coefficient of the final individual is 7/64 -- how would you find the Monte Carlo estimate of this from this table? Since each component pedigree is the same, and the same haplotypes are scored, you should see the same (up to Monte Carlo variation) results for each component. Do you see more/less similarity between the different components, or between the loci in a given component? Which would you expect to be more similar? Why? Which loci do you expect to have most similar results? Why?

Assignment part 1

Write a brief paragraph giving your answers to these questions, and/or summarizing anything else you see interesting in the results.

A2: ibddrop, version 2

This version of ibddrop shows a different way of scoring the multilocus ibd. In A1 we scored locus-by-locus, and so ended up with probabilities that were the same (on average) at each locus, and didn't really show the dependence between linked loci.

Look at the parameter file ibddrop2.par. Down to the scoring the information is the same, although the marker map information looks a bit different. I gave the male map in (Haldane) centiMorgans rather than recombination fractions, but if fact it's the same map (22.3 cM is a recombination fraction of 0.18, and 11.15 cM is recombination fraction 0.1). The request for 40000 Monte Carlo realizations at the bottom of the file is also as before.

The difference is in the scoring patterns: now we request the program to score over overlapping windows of 3 consecutive loci. Since our loci are ordered on the chromosome as (M1,M2,T,M3,M4,M5) these windows will be (M1,M2,T), (M2,T,M3), (T,M3,M4), and (M3,M4,M5). What it will score will be Yes(1)/No(0) events, and so simplest is just to give it pairs of haplotypes (Proband gametes) and ask it to score ibd (Yes=1) or not (No =0). So we give it five such pairs which are much the same as for A1: the two of the final individual, and a pair of cousin grandparents, but now only 2 at a time.

This will be clearer on looking at the output, so now run ibddrop on this second parameter file by typing
% ibddrop ibddrop2.par > ibd2.out

The output consists of a list of possible ibd patterns for the three loci and four columns of relative frequencies. Remember that we are looking at 3 loci at a time. The ibd pattern 0 1 0, for example, is the event that the individual's two alleles are not ibd at the first locus, are ibd at the second, and aren't ibd at the third, where first, second, and third refer to each set of 3 loci in a window.

Here are some things you can think about to see whether the output makes sense:

1. Window 4 consists of markers 3,4 and 5, which are equally spaced. This should mean that, for this final column of relative frequencies, there should be symmetry between patterns 001 & 100, 011 & 110.

2. Is the output from this program consistent with the two-locus inbreeding probabilities you computed using kin in Lab 3? For example, look at the first column. Here, the recombination fraction between the first two loci is 0.18. For the simulations that ibddropgenerated, the relative frequency of fred being ibd at both of the first two loci is Pr(110) + Pr(111) = 0.0346 (for my simulations). According to kin, the probability should be 0.03455. That looks good to me. You could do the same sort of analysis to check the results of your simulations against the two-locus kinship/inbreeding from kin for intervals with recombination fraction 0.18 or 0.1.

3. One more thing: Note that, in the absence of interference, one interval with r = 0.18 is the same as two adjacent intervals with r = 0.1. In column 4, Pr(101) + Pr(111) = Pr( both ends of an interval with r = 0.18 are ibd). Also, in column 1, Pr(110) + Pr(111) = Pr( both ends of an interval with r = 0.18 are ibd). My simulations give pretty good agreement between these.

Assignment part 2

Make some of the comparisons suggested in the previous paragraph and report on them, in a brief paragraph. Say clearly which numbers you expect to be equal, and why.

B: lm_auto

Now we are ready to look at lm_auto. We will use the same loci and scoring of ibd as in A1, but now our realizations are MCMC realizations that are conditional on trait and marker data.

Recall the trait data on the three copies of the JV pedigree is the fifth data column in the pedigree file -- see the description of ped45 above.
So look again at ped45. The code 0 means unobserved, so on the first JV pedigree there are no trait data (but there will be marker data). The second JV pedigree has the final individual fred with phenotype 4. This is our code for being affected with a recessive trait. The third JV pedigree has this same phenotype for the final individual (now a female 5w1), but additionally I have specified two of her great-grand-parents to be known carriers of this recessive allele (code = 3). (In real life, we'd be unlikely to know this, but we might be doing a "what if" exercise.)

Now look at the parameter file lm_auto.par.
The first part is much as before, except that instead of telling it to simulate 5 markers and 1 trait, we tell it to select for analysis all markers and 1 trait. We will give it data on 5 markers, so that is really no change. Next we tell it about the markers, giving it the same marker map as before. Each marker has 4 alleles, and the marker allele frequencies for 4 of the markers are the same. Marker 3 has slightly different frequencies for its 4 alleles, and allele 2 for the traits (the disease allele) has frequency 0.05. (So our recessive trait is not very rare.)

Next we give it the marker data. In this rather artificial example we have not only the same marker data on each of the three JV pedigrees, but the same data at each of the five linked markers! There are just 5 individuals observed on each JV pedigree -- the final individual, one parent and three grandparents. The final individual is homozygous 3 3 and each of his typed ancestors carries one 3 allele. It is therefore interesting which of these type-3 alleles are ibd. In fact, these marker data are the same as the example in the book, and also the one we discussed in class at the end of chapter 2 of the notes. Below the marker data are the haplotypes (proband gametes) to be scored, just as before.

Finally, there is some information on the MCMC, which you do not need to worry about. For those who care, the first three statements tell it how to get its initial pattern of inheritance, and how to get started. The last 4 statements tell it about how to do the MCMC. Sample by scan means that it will update all the inheritance patterns in turn, before the next round of updating. The number of MC iterations is now the number of scans. Because it may take a while, we ask it to let us know how it is doing every 10000 scans. Finally, our MCMC sampler is a mixture or two samplers, which we call L-sampler and M-sampler. Here we are asking it to use the L-sampler a random 20% of the time, and the remaining 80% will be M-sampler.

Now run lm_auto on this parameter file by typing % lm_auto lm_auto.par ped "ped45" > lm_auto.out
On abacus, this took about 4 minutes to run. Be patient! If it is really taking too long, you may reduce the number of MC iterations from 40000 to (say) 10000, but your results will not be so good.

Now look at the output file lm_auto.out. The first part should be very similar to your ibd1.out output. For the results of the MCMC, focus just on the tables of IBD probabilities, scored locus-by-locus just as in A1.

Assignment part 3

Your assignment here is to explain some of these results in relation to the no-data ibddrop results, and to the three levels of data we give in lm_auto. Be brief in your answers -- a brief paragraph for each of 1,2,3 is enough.

1. The first component has no trait data, only the marker data.
(i) Note the haplotypes we are scoring all can carry a 3 allele. What does this do to the IBD probabilities? Is this as you would expect? Why?
(ii) In ibddrop all loci looked the same (up to Monte Carlo variation). Now, even though markers 1,2,4 and 5 have the same data and the same allele frequencies, you should see more estimated IBD at 2 and 4 than at 1 and 5. Why?

2. Now in the second JV pedigree, we have added the information that fred is affected with this recessive trait. What does this do to the IBD probabilities at the trait locus? What does this do to the probability that fred has two IBD haplotypes at the trait locus? And at neighboring markers? What, in general, happens to the IBD probabilities for these markers linked to the trait? Why?

3. In this final JV pedigree, we pretend we know that two specific great-grandparents carry the recessive disease allele, as well as the final individual being affected. Look again at ped45 and note how the haplotypes we are scoring related to these great-grandparents 2w1 and 2w2. How are the IBD probabilities affected by adding these additional trait data? Is this as you would expect? Why? Which markers are most/least affected by adding trait data? Why?