Stat550: Lab 5: MORGAN program lm_linkage

For the last lab we have another MORGAN program, which uses MCMC to estimate lod scores across a chromosome of markers. This is the program lm_linkage. You can find out more about lm_linkage in the first part of Chapter 11 of the MORGAN Tutorial.

Note: The required parameter and pedigree files to run the examples below have been added to /datafiles/Lab4_auto on statgen.stat.washington.edu.

We will use the same example used for lm_auto in Lab-4. If you were unlucky and
(i) Your affected inbred individual was not homozygous at at least one of marker-2 and marker-3, and/or
(ii) Your pair of affected bilateral relatives did not have the same genotype at at least one of marker 2 and marker 3
you may want to rerun markerdrop to see if you can have better luck.
It may take several tries-- you are basically hoping to get data where your affected individuals are aa.
However, the frequency of aa in the population is 0.0001, and most people are AA at the trait locus.
The aa people are almost certainly affected, but the AA people are affected with probability 0.0001.
The inbreeding and relatedness of the affected individuals increases our chances of finding them aa, but ...

Recall, I was lucky, with marker data

fred is homozygous at markers 1, 2, and 3.
grandma and 3v3 have the same genotypes at markers 2, 3 and 4.
(Recall markerdrop outputs its genotypes as phased, so individuals have the same genotype if they have the same two alleles, either way round.)

I will again show the lab by working through my own example.
To run lm_linkage you first need the imput files you had before:

For later, I included also a reduced marker data file in which only the three affected individuals have marker data.

The parameter files: lm_link_1.par, lm_link_1r.par, and lm_link_2r.par

There wll be three runs of the lm_linkage program, using these three parameter files. Consider first the file lm_link_1.par.

This file is almost exactly the same as the lm_auto_3.par of Lab-4. The only differences are that I have removed redundent statements about trait-1 and trait-2 (we will use only the full trait data, trait-3) and that proband gamete statements have been replaced by statements that specify the locations for lod score computation:
map test tloc 33 all interval proportions 0.2 0.5 0.8
map test tloc 33 external recomb frac 0.05 0.2 0.4
This says that the lod scores will be computed at 3 points in each marker interval, at 20%, 50% and 80% of the interval, and also at 3 points at the end, at recombination fractions 0.05, 0.2 and 0.4 before the first marker and after the last.

Running the lm_linkage program:

First run, with lm_link_1.par

Run lm_linkage, by typing % lm_linkage lm_link_1.par > link1.out
As always, the program generates quite a bit of output, so it is probably easier to look at it in a file.

Now look at the output link1.out. A lot of MORGAN output is involved in it telling you what it understood you to tell it to do. This can be a bit tedious, but is well worth checking!! Most of the silly errors we have made in running MORGAN were because there was something a little bit wrong with a parameter file, and it did what the parameter file said.

The initial output is now as before, except that in addition to the marker map it prints additional maps showing the locations ("T") at which lod scores will be computed. All interval proportions and reconbination fractions are converted to centiMorgans.

Then it summarizes all the other input just as in Lab-4, and finally it runs its MCMC and prints out a table of the estimated lod scores. The table gives the cM positions (setting the first marker as the origin, since we did not tell it otherwise), and then the base-10 lod score (which is the key output), and finally an estimate Monte Carlo standard error (which you need not worry about-- except to note it should be quite small).

Note that it fnds quite a good signal (although on this one pedigree we would not get a lod score of 3!). It does not pinpoint the trait, and this is characteristic -- with few meioses, there are few recombinants that can differentiate exactly which loci the DNA of the trait appears to be segregating with. But we do get a lod score close to 1.5 from marker-2 to beyond marker-4. The highest lod score (1.81) is at marker-4.

Second run, with lm_link_1r.par

As it stands, this example is a bit unrealistic for this sort of problem -- we have marker data on all the last three generations, which rather tightly constrains the inheritance. More realistically, in this type of study, we would have DNA, and hence marker data, only on the affected individuals (but could well still know that other parent individuals etc, are not affected). The parameter file lm_link_1r.par, is exactly the same as lm_link_1.par except that we now use the reduced marker data jvped_2014.mark2.

I ran lm_linkage, by typing % lm_linkage lm_link_1r.par > link1r.out
and looked at the output link1r.out.
Note that the final lod scores is reduced -- it still manages a lod score of 1 at markers 2 and 3, but barely.

Third run, with lm_link_2r.par

Note that the three affected individuals (fred, grandma, and 3v3) carry the more common marker alleles at marker-2 and maker-3 flanking the true (simulated) trait locus.

To examine the effect of this, I changed the marker allele frequencies in the parameter file lm_link_2r.par. The only change from lm_link_1r.par s in these marker allele frequencies which are now:
set markers 1 4 5 allele freqs 0.4 0.3 0.2 0.1
set markers 2 allele freqs 0.4 0.05 0.25 0.3
set markers 3 allele freqs 0.05 0.3 0.25 0.4
That is, I have made the allele for which fred is homozygous (allele 2 at marker-2, allele 1 at marker-3) relatively rare, and adjusted the other frequencies so the sum is still 1.

I ran lm_linkage, by typing % lm_linkage lm_link_2r.par > link2r.out
and looked at the output link2r.out.
My lod scores in the region marker-2 to marker-3 are increased.
Why should I have expected this?

Note: This is important. Many of the false-positive linkage findings in the literature, especially in the case of marker data available only on affected individuals, have been due to assuming that the associated marker alleles are rare when they are not. This can happen when marker allele frequencies in the population of interest are not well studied, so that data-base frequencies from a different population are used instead.

Assignment

Repeat the above analyses with your own pedigree and marker data.

Note: Please remember to tell me who is your inbred affected individual and who are your affected bilateral relatives, as well as any important information about them that influences the lod score.

Comment as specifically as you can on
(1) The lod scores you obtain in the first run,
(2) The reasons for any observed changes in the lod score when you only have marker data on your affected indiviuals, and
(3) The reasons for any observed changes in the lod score when you change the marker allele frequencies of the alleles carried by your affected individuals at the flanking markers marker-2 and marker-3.