DOCUMENTATION FOR PROGRAM MCLEEPS VERSION 1.1
August 18, 2001
Monte Carlo Likelihood for Estimating Effective Population Size
Program authored by
Eric Anderson, University of Washington, Seattle.
Ellen Williamson, University of California, Berkeley.
Send questions and bug reports to
Eric Anderson (eric.anderson@stanfordalumni.org)
---------------------------------------------
CONTENTS
---------
-DIFFERENCES FROM VERSION 1.0
-PURPOSE
-COMPONENTS
*SOURCE CODE DISTRIBUTION
*MAC EXECUTABLE DISTRIBUTION
-UNIX INSTALLATION
-COMPILING ON OTHER PLATFORMS
-INPUT DATA FILE
-TESTING THE PROGRAM
-THE OUTPUT
-A NOTE ON THE CONFIDENCE INTERVALS
-RUNNING YOUR OWN DATA
-RUNNING MCLEEPS "IN THE BACKGROUND" ON THE MAC
-KNOWN INEFFICIENCIES
----------------------------------------------
DIFFERENCES FROM VERSION 1.0:
Version 1.1 represents only a small change from 1.0. Namely, a bug in function
RealizeNew() was fixed. This bug occurred with certain types of datasets. I thank
Loic Degen at the Laboratoire de Biologie de la Conservation in Lausanne for alerting
me to this bug.
PURPOSE:
MCLEEPS uses an importance sampling method to compute the likelihood for
the population genetics parameter Ne, given data on temporal changes of
allele frequencies in a population. The details of the algorithm
appear in:
Anderson,E.C., Williamson,E.G. and Thompson,E.A. (2000) Monte Carlo evaluation of the
likelihood for Ne from temporally-spaced samples. Genetics 156:2109-2118.
COMPONENTS:
SOURCE CODE DISTRIBUTION:
The source code distribution comes in an archive called "mcleeps_src.tar.gz".
After decompressing and extracting the archive you should have the following directories
and files:
/MCLEEPS_V_1.1_src
Readme_MCLEEPS_V_1.1.txt (this file)
/Sources
Makefile
ne_classes.cpp
ne_headers.cpp
com.c
linpack.c
ranlib.c
ranlib.h
testfile.txt
/RANLIB
(Things that come with the distribution of RANLIB)
/TestCompare
TC_SettingsAndDataOut.txt
TC_LogLikelihoodsOut.txt
TC_LocByLocLogLOut.txt
TC_LocusHistogramsOut.txt
MAC EXECUTABLE DISTRIBUTION:
The Mac executable distribution comes in a Stuffit Archive named "mcleeps.sit".
This will open with Stuffit Expander. After unstuffing you should have the following
folders and files:
/MCLEEPS_V_1.1_Mac
MCLEEPS_Mac_1.1
testfile.txt
/TestCompare
TC_SettingsAndDataOut.txt
TC_LogLikelihoodsOut.txt
TC_LocByLocLogLOut.txt
TC_LocusHistogramsOut.txt
The directory RANLIB contains the standard distribution for the RANLIB routines
which are in the public domain. Since I use some source code from RANLIB
(namely ranlib.c linpack.c and com.c) I included the whole distribution of RANLIB
in the source code distribution of MCLEEPS for completeness and documentation.
Nothing in the directory RANLIB is actually used, however, and it may be deleted if desired.
The directory /TestCompare contains output files for the test dataset. These are for
comparison to results obtained on your computer to make sure that the program is working
correctly on your machine.
UNIX INSTALLATION:
To compile the program into an executable named MCLEEPS_V_1.1, go to the
/Sources directory and type "make". This will execute the compiling and linking commands
in the Makefile. There should be a few warning messages from the ranlib code,
but none from ne_classes.cpp. These sources are compiled with g++. To use the
platform-specific compiler, modify the Makefile appropriately.
COMPILING ON OTHER PLATFORMS:
If you have a C++ compiler on another platform (Windows, for example) you should
be able to read the Makefile in the SOURCE CODE DISTRIBUTION and figure out how
to compile the source code into an executable on your platform.
INPUT DATA FILE:
The data you want analyzed must appear in a file named whatever you want to name it,
so long as it is all one word (No whitespace!!) and does not include punctuation other
than the underscore character and periods. For example, valid filenames would be,
"flies2000" or "salmon.1996" or "input_file.in". Invalid names would be, for example,
"last year's genes" or "temporal data". Notice: the quotation marks are not part of
the filenames.
You should put your data file in the same directory or folder as the executable
program MCLEEPS. The input file must be a white-space delimited
text file. The format of the data within it MUST correspond to the data format in
"testfile.txt". An explanation of that format is given below. Note that the
"<--"'s and the explanations following them cannot appear in the input file itself.
EXPLANATION OF DATA INPUT FORMAT FOR FILE "testfile.txt"
# Simulated data on 4 loci, with 5 alleles each, to test the program.
# You may include comments preceded by the '#' symbol, but only before you
# reach the actual data in the file
1000 <-- number of Monte Carlo replicates for each value of Ne
40 160 20 <-- (diploid) Ne values for which to compute likelihood: 40 to 160, in steps of 20
4 <-- number of loci
0 4 8 <-- generations when samples were taken. First is always 0.
6 20 72 100 2 <-- First Locus allele counts (of all alleles) at sample from generation 0
10 32 102 56 0 <-- First locus allele counts at next sample (generation 4)
3 14 93 90 0 <-- First locus allele counts at third (and last) sample (generation 8)
13 58 24 88 17 <-- Second locus allele counts at sample from generation 0.
9 68 35 76 12 <-- etc.
10 56 23 111 0
5 42 83 2 68
0 32 91 7 70
0 21 89 5 85
16 10 168 5 9
9 15 165 5 6
20 37 121 4 18
TESTING THE PROGRAM
Before you try running your own data through the program, you should run the
data in "testfile.txt" to make sure that the program is running properly on
your machine. To do so, follow these steps:
(1) Launch the program
(2) When prompted for the input file name, type "testfile.txt" (remember,
you don't type the quotation marks).
(3) The program should then tell you that you need two random number seeds.
Enter the values 1 and 2.
(4) You should then be prompted to inspect the file "aa_SettingAndDataOut.txt"
to make sure it resembles the input in "testfile.txt".
Do this. It should further familiarize you with what goes into the
data input file.
(5) Then, enter any character and let the program execute.
It takes about 2.5 minutes to complete execution on a 266 Mhz G3 laptop, so
it should be faster on most newer machines and Unix machines.
When execution has completed, you should compare the output files to those in
the directory /TestCompare. If the output isn't identical, (save for the differences
in program start and stop times), check to make sure
that the random seeds used were 1 and 2 (in that order). Please report any
differences to eric.anderson@stanfordalumni.org.
THE OUTPUT:
The program overwrites the output files each time it is run. This means that
after a run of the program you should MOVE THE OUTPUT FILES TO A DIFFERENT DIRECTORY
if you want to save them. Otherwise the results of the previous run will be lost if
you re-run the program. The output comes in four different files which are space-delimited
text files which may be easily imported into graphing or spreadsheet programs:
aa_SettingsAndDataOut.txt:
This file holds information about the settings and data for the run. It is
largely self-explanatory.
aa_LogLikelihoodsOut.txt
This file gives the Monte Carlo log-likelihood estimates and confidence intervals
for the different values of Ne. The first column is the diploid effective size;
the second column is the log-likelihood; the third column is the lower endpoint
of an approximate 95% confidence interval around the Monte Carlo estimate of the
log-likelihood; The fourth column is the upper endpoint of the 95% confidence
interval.
aa_LocByLocLogLOut.txt
This file is in a similar format, however there is a set of three columns
for each of the loci in the data. This way you can look at the log-likelihood
for Ne on a locus by locus basis. It is also useful for seeing which of
the loci have data for which an accurate estimate of the log-likelihood for
Ne is difficult---these will be the ones with wider condfidence intervals. Note
that the overall log-likelihood is the sum of the log-likelihood for each
locus, but the confidence interval endpoints on the overall likelihood are
not the sum of the locus-specific confidence interval endpoints.
aa_LocusHistogramsOut.txt
These are histograms for each locus and each value of Ne of the log of the
contribution to the Monte Carlo average. There is only output here for
number of reps greater than 1,000. Thus, there will be no numerical output
for the testfile. The information in this file can be used to assess whether
the importance sampling is unstable (as it might be with pathological data) or
not. See the section on importance sampling in Gelman, Carlin, Stern, and
Rubin (1996), "Bayesian Data Analysis," New York: Chapman and Hall for more
information about this.
NOTE: if you are running on the Mac, make sure that all files with the above
names in the directory from which you are launching MCLEEPS_Mac_1.1 are not
currently opened by any applications. If so, the program will be unable to
modify them. However, once you have started MCLEEPS_Mac_1.1, you may open
the files with other applications (e.g. a text editor) to monitor progress.
A NOTE ON THE CONFIDENCE INTERVALS:
The 95% confidence intervals are approximate, based on an asymptotic normal approximation.
They try to give confidence levels for the estimation of the log-likelihood at
particular values of Ne. Because of the correlated sampling (see below) the log-
likelihood differences between different values of Ne are typically better estimated,
than suggested by the confidence intervals on the log-likelihood values themselves.
We haven't yet built in a way to compute confidence intervals on the log-likelihood
differences.
Sometimes the lower confidence interval will be reported as "-Inf". This occurs when
the Monte Carlo variance is so high (either because the number of replicates is too low,
or because the importance sampling algorithm used is not working efficiently with a
particular dataset) that the lower limit of the confidence interval computed by the
asymptotic theory is less than zero for the likelihood, and hence less than -infinity
for the log-likelihood. Try increasing the number of replicates used. If that
doesn't help, then the importance sampling algorithm may not work well with your
dataset. See "KNOWN INEFFICIENCIES" below.
RUNNING YOUR OWN DATA:
Before running your own data, be sure to change the random number seeds in the file
RandomSeedsIn.txt (that should reside in the directory after you ran the program on
"testfile.txt") from "1 2" to any two positive integers of your choice. You should
use those two random numbers for any runs of the same dataset because the program
uses correlated sampling (a variance reduction technique) by using the same block of
random numbers for the Monte Carlo evaluation at each different value of Ne. This is
_not_ discussed in the Genetics article.
A good strategy for running your own data is to start with a low number of reps, say
1000 or so, and try only a few values (say 5) of Ne across the range of plausible values.
Then, the results from that run should allow you to narrow the range of Ne values you wish
to compute a log-likelihood for. You should try to compute values of the log-likelihood
for values of Ne near the peak of the log-likelihood curve and for other values of Ne
out to those for which the log-likelihood has dropped by 3 or 4 from the maximum.
Remember, including more values of Ne will increase the running time, as will increasing
the number of Monte Carlo replicates for each value of Ne.
A general rule of thumb for determining the number of Monte Carlo replicates is that
if you wish to shrink the confidence intervals on the log-likelihood estimates by a
factor of 1/n, then you should do n^2 times more replicates. This will not always hold,
but is a convenient rule of thumb.
RUNNING MCLEEPS "IN THE BACKGROUND" ON THE MAC:
With some data, very long runs may be required. This program runs well in the
background on the Mac. Thus, you may start the program and then work in other applications
while it is executing. Execution of MCLEEPS may be slowed if the other tasks you are
working on are computationally intensive. In this way, however, you needn't monopolize
your processor with MCLEEPS. If you wish to suspend execution of MCLEEPS on the Mac
temporarily, you may hit any key. The screen will alert you to the fact that the program
is suspended. Execution will resume when you enter any character.
KNOWN INEFFICIENCIES:
This importance sampling algorithm is not faultless. It has difficulty in some situations.
Most notably, alleles which do not appear in the first sample, but appear in large numbers
in future samples give it trouble, meaning that even if you use very many Monte Carlo reps, the
confidence intervals on the estimates may not shrink as desired. It also has difficulty
with alleles that are present in relatively high numbers in early samples, but which are then
lost from later samples. Addtionally, loci with many loci (<10) occasionally pose problems.
We are working to improve the importance sampling in these cases. If you find that MCLEEPS
isn't working well for your dataset, please notify Eric at eric.anderson@stanfordalumni.org
so he can address the problem in a future version of MCLEEPS.