Stat 550 (DL): Lab 5: Computing lod scores on pedigrees

In 2006, I switched this lab from Genehunter to Merlin. Both programs, among other things, calculate LOD scores on small pedigrees when there are data at multiple marker loci, but Merlin is better software, better algorithms, more flexible, and better supported at this time.

I installed an updated version of Merlin on the biostat computers, in 2006, and again more recently. Hopefuly all the links and commands are updated, but if there is anything not right, please email me.

A disadvantage of Merlin is that it requires quite a few input files. For the examples for this lab I have put the data files in the subdirectory merlin_2006. I have also made a new description of these files, with links to the data files and to a description of the problem of genetic mapping of the Werner's Syndrome disease gene.

Part 1: The ApoB marker

For the first part of this lab, we will be exploring how misspecified allele frequencies can lead to spurious evidence for linkage. You should read the contents of the homoApoB.dat file. It explains the situation and summarizes the data we will be using.

In short, our data consists of pedigree and marker information on nineteen inbred Japanese families. When this data set was originally analyzed, the researchers used standard Caucasian allele frequencies in their model. Incorrect allele frequencies caused the analysis to indicate that the locus responsible for Werner's syndrome was located on chromosome 2 near the ApoB marker. Since that time, the true locus has been found on chromosome 8. In the first part of this lab, we will look at how different choices for marker allele frequencies changes the outcome of a LOD score linkage analysis.

Copy the files from the data subdirectory merlin_2006 into your abacus account.
As described in the description of these files the ones you need for this part are

Just for reference, here is a list of the (major) allele frequencies in the three groups:

Marker allele                                1           2            3           4           5          6

Standard Caucasian                  0.2190 0.0776  0.4040  0.0500  0.0756  0.0378 

UW Caucasian sample              0.2000 0.0333  0.4667  0.0500  0.0833  0.0667 

UW Japanese sample               0.6382  0.0263  0.1250  0.1184  0.0263  0.0132 

To run merlin on these data, using the first of the three allele frequency files, type:

% merlin -d apob3_merlin.dat -p apob3_merlin.ped -m apob3_merlin.mp -f apob3_ca.freq --model ws_merlin.model --step 20

You probably want to cut-and-paste this lengthy command! Your browser has probably split the line, but make sure you enter it as a single command -- all on one line. Also remember we do not type the ``%''.
You can see here we are mostly just specifying the data files, "-d" for the data specification, "-p" for the actual pedigree data, "-m" for the marker map, "-f" for the marker allele frequencies, and "--model" for the model file. The last option "--step 20" specifies that lod scores will be computed at 20 points between the ApoB3 marker and the dummy marker I added.

You will get a warning message about allele frequencies not adding to 1, but you can ignore that. Caution: As I just learned the hard way, if you misname your allele frequency file it will just go ahead and estimate its own frequencies from the data. Then, of course, you get the same lod scores every time. So, do not totally ignore the Merlin's messages!

You should finally get a table of lod scores. There are 4 columns: the first two are the position and the lod score for that position. This lod score is summed over all the pedigrees. The final two columns relate to heterogeneity among pedigrees. If the ALPHA column is less than 1, it indicates that the pedigrees are giving strongly conflicting information, and the final HLOD column says what the LOD score would be if we actually model that only a fraction ALPHA of the pedigrees are linked and the rest not. Run merlin twice more, using the UW Caucasian frequencies apob3_uw.freq and using the Japanese frequencies apob3_jp.freq.

Here are some questions for you to answer (and turn in):

For questions 1 and 2, we did not actually output the separate lod scores by family-- it is too much stuff!!, but you should see that close to the APOB3 marker your ALPHA column indicates heterogeneity. To answer these two questions you may find it easiest to consult the data summary file

1. Families 1, 3, 6, 7, 8, 9, 11, 15, and 17 produce negative LOD scores while families 2, 4, 5, 10, 12,13, 14, and 16 produce positive LOD scores. Why?

2. Families 2 and 4 share the same pedigree structure, but family 2 produces LOD scores that are considerably lower than for family 4. Why?

3. From the total lods scores for your 3 merlin runs, the two using the Caucasian allele frequencies should indicate some evidence for linkage near the ApoB marker (LOD scores about 2.0). Using the Japanese allele frequencies, there is no longer much support for linkage in the region. Why?

Part II: Chromosome 8

In this part of this lab, we will reproduce the analysis that finally located the disease locus on chromosome 8. We have the same pedigrees as above, but now have data on the marker types at 13 markers along chromosome 8.

As described in the description of these files the ones you need for this part are

We will run merlin twice, once with the Japanese allele frequencies (file ws_jp.freq) and once with the Caucasian allele frequencies (file ws_ca.freq). 

To run merlin using ws_ca.freq type:

% merlin -d ws_merlin.dat -p ws_merlin.ped -m ws_merlin.mp -f ws_ca.freq --model ws_merlin.model --grid 1 --pdf

and similarly with ws_jp.freq.
You probably want to cut-and-paste this lengthy command! Your browser has probably split the line, but make sure you enter it as a single command -- all on one line. Also remember we do not type the ``%''.

This time, instead of the --step option (you can use "--step 3" for example, if you prefer), I have used --grid 1 --pdf. This means it will compute lod scores at 1cM intervals across the chromosome of the markers, and will produce a PDF file with a plot of the lod score. It will call this file merlin.pdf, so you probably want to rename it before you do the second run, or it will overwrite it. (Use % mv merlin.pdf merlin1.pdf, for example.)

Here are some questions for you to answer (and turn in):

4. Compare the lod score curves from the two analyses. You should see that the LOD scores are fairly similar. Does this mean that the Japanese and Caucasian allele frequencies are similar for the markers along this chromosome? Is it possible to have some of the allele frequencies wrong, but still end up with reasonably correct LOD scores? Comment.

We have seen that misspecified marker allele frequencies can make the LOD score artificially high. Can misspecified marker allele frequencies make the LOD score lower than it should be?