DOCUMENTATION FOR PROGRAM MCLEEPS VERSION 1.1 August 18, 2001 Monte Carlo Likelihood for Estimating Effective Population Size Program authored by Eric Anderson, University of Washington, Seattle. Ellen Williamson, University of California, Berkeley. Send questions and bug reports to Eric Anderson (eric.anderson@stanfordalumni.org) --------------------------------------------- CONTENTS --------- -DIFFERENCES FROM VERSION 1.0 -PURPOSE -COMPONENTS *SOURCE CODE DISTRIBUTION *MAC EXECUTABLE DISTRIBUTION -UNIX INSTALLATION -COMPILING ON OTHER PLATFORMS -INPUT DATA FILE -TESTING THE PROGRAM -THE OUTPUT -A NOTE ON THE CONFIDENCE INTERVALS -RUNNING YOUR OWN DATA -RUNNING MCLEEPS "IN THE BACKGROUND" ON THE MAC -KNOWN INEFFICIENCIES ---------------------------------------------- DIFFERENCES FROM VERSION 1.0: Version 1.1 represents only a small change from 1.0. Namely, a bug in function RealizeNew() was fixed. This bug occurred with certain types of datasets. I thank Loic Degen at the Laboratoire de Biologie de la Conservation in Lausanne for alerting me to this bug. PURPOSE: MCLEEPS uses an importance sampling method to compute the likelihood for the population genetics parameter Ne, given data on temporal changes of allele frequencies in a population. The details of the algorithm appear in: Anderson,E.C., Williamson,E.G. and Thompson,E.A. (2000) Monte Carlo evaluation of the likelihood for Ne from temporally-spaced samples. Genetics 156:2109-2118. COMPONENTS: SOURCE CODE DISTRIBUTION: The source code distribution comes in an archive called "mcleeps_src.tar.gz". After decompressing and extracting the archive you should have the following directories and files: /MCLEEPS_V_1.1_src Readme_MCLEEPS_V_1.1.txt (this file) /Sources Makefile ne_classes.cpp ne_headers.cpp com.c linpack.c ranlib.c ranlib.h testfile.txt /RANLIB (Things that come with the distribution of RANLIB) /TestCompare TC_SettingsAndDataOut.txt TC_LogLikelihoodsOut.txt TC_LocByLocLogLOut.txt TC_LocusHistogramsOut.txt MAC EXECUTABLE DISTRIBUTION: The Mac executable distribution comes in a Stuffit Archive named "mcleeps.sit". This will open with Stuffit Expander. After unstuffing you should have the following folders and files: /MCLEEPS_V_1.1_Mac MCLEEPS_Mac_1.1 testfile.txt /TestCompare TC_SettingsAndDataOut.txt TC_LogLikelihoodsOut.txt TC_LocByLocLogLOut.txt TC_LocusHistogramsOut.txt The directory RANLIB contains the standard distribution for the RANLIB routines which are in the public domain. Since I use some source code from RANLIB (namely ranlib.c linpack.c and com.c) I included the whole distribution of RANLIB in the source code distribution of MCLEEPS for completeness and documentation. Nothing in the directory RANLIB is actually used, however, and it may be deleted if desired. The directory /TestCompare contains output files for the test dataset. These are for comparison to results obtained on your computer to make sure that the program is working correctly on your machine. UNIX INSTALLATION: To compile the program into an executable named MCLEEPS_V_1.1, go to the /Sources directory and type "make". This will execute the compiling and linking commands in the Makefile. There should be a few warning messages from the ranlib code, but none from ne_classes.cpp. These sources are compiled with g++. To use the platform-specific compiler, modify the Makefile appropriately. COMPILING ON OTHER PLATFORMS: If you have a C++ compiler on another platform (Windows, for example) you should be able to read the Makefile in the SOURCE CODE DISTRIBUTION and figure out how to compile the source code into an executable on your platform. INPUT DATA FILE: The data you want analyzed must appear in a file named whatever you want to name it, so long as it is all one word (No whitespace!!) and does not include punctuation other than the underscore character and periods. For example, valid filenames would be, "flies2000" or "salmon.1996" or "input_file.in". Invalid names would be, for example, "last year's genes" or "temporal data". Notice: the quotation marks are not part of the filenames. You should put your data file in the same directory or folder as the executable program MCLEEPS. The input file must be a white-space delimited text file. The format of the data within it MUST correspond to the data format in "testfile.txt". An explanation of that format is given below. Note that the "<--"'s and the explanations following them cannot appear in the input file itself. EXPLANATION OF DATA INPUT FORMAT FOR FILE "testfile.txt" # Simulated data on 4 loci, with 5 alleles each, to test the program. # You may include comments preceded by the '#' symbol, but only before you # reach the actual data in the file 1000 <-- number of Monte Carlo replicates for each value of Ne 40 160 20 <-- (diploid) Ne values for which to compute likelihood: 40 to 160, in steps of 20 4 <-- number of loci 0 4 8 <-- generations when samples were taken. First is always 0. 6 20 72 100 2 <-- First Locus allele counts (of all alleles) at sample from generation 0 10 32 102 56 0 <-- First locus allele counts at next sample (generation 4) 3 14 93 90 0 <-- First locus allele counts at third (and last) sample (generation 8) 13 58 24 88 17 <-- Second locus allele counts at sample from generation 0. 9 68 35 76 12 <-- etc. 10 56 23 111 0 5 42 83 2 68 0 32 91 7 70 0 21 89 5 85 16 10 168 5 9 9 15 165 5 6 20 37 121 4 18 TESTING THE PROGRAM Before you try running your own data through the program, you should run the data in "testfile.txt" to make sure that the program is running properly on your machine. To do so, follow these steps: (1) Launch the program (2) When prompted for the input file name, type "testfile.txt" (remember, you don't type the quotation marks). (3) The program should then tell you that you need two random number seeds. Enter the values 1 and 2. (4) You should then be prompted to inspect the file "aa_SettingAndDataOut.txt" to make sure it resembles the input in "testfile.txt". Do this. It should further familiarize you with what goes into the data input file. (5) Then, enter any character and let the program execute. It takes about 2.5 minutes to complete execution on a 266 Mhz G3 laptop, so it should be faster on most newer machines and Unix machines. When execution has completed, you should compare the output files to those in the directory /TestCompare. If the output isn't identical, (save for the differences in program start and stop times), check to make sure that the random seeds used were 1 and 2 (in that order). Please report any differences to eric.anderson@stanfordalumni.org. THE OUTPUT: The program overwrites the output files each time it is run. This means that after a run of the program you should MOVE THE OUTPUT FILES TO A DIFFERENT DIRECTORY if you want to save them. Otherwise the results of the previous run will be lost if you re-run the program. The output comes in four different files which are space-delimited text files which may be easily imported into graphing or spreadsheet programs: aa_SettingsAndDataOut.txt: This file holds information about the settings and data for the run. It is largely self-explanatory. aa_LogLikelihoodsOut.txt This file gives the Monte Carlo log-likelihood estimates and confidence intervals for the different values of Ne. The first column is the diploid effective size; the second column is the log-likelihood; the third column is the lower endpoint of an approximate 95% confidence interval around the Monte Carlo estimate of the log-likelihood; The fourth column is the upper endpoint of the 95% confidence interval. aa_LocByLocLogLOut.txt This file is in a similar format, however there is a set of three columns for each of the loci in the data. This way you can look at the log-likelihood for Ne on a locus by locus basis. It is also useful for seeing which of the loci have data for which an accurate estimate of the log-likelihood for Ne is difficult---these will be the ones with wider condfidence intervals. Note that the overall log-likelihood is the sum of the log-likelihood for each locus, but the confidence interval endpoints on the overall likelihood are not the sum of the locus-specific confidence interval endpoints. aa_LocusHistogramsOut.txt These are histograms for each locus and each value of Ne of the log of the contribution to the Monte Carlo average. There is only output here for number of reps greater than 1,000. Thus, there will be no numerical output for the testfile. The information in this file can be used to assess whether the importance sampling is unstable (as it might be with pathological data) or not. See the section on importance sampling in Gelman, Carlin, Stern, and Rubin (1996), "Bayesian Data Analysis," New York: Chapman and Hall for more information about this. NOTE: if you are running on the Mac, make sure that all files with the above names in the directory from which you are launching MCLEEPS_Mac_1.1 are not currently opened by any applications. If so, the program will be unable to modify them. However, once you have started MCLEEPS_Mac_1.1, you may open the files with other applications (e.g. a text editor) to monitor progress. A NOTE ON THE CONFIDENCE INTERVALS: The 95% confidence intervals are approximate, based on an asymptotic normal approximation. They try to give confidence levels for the estimation of the log-likelihood at particular values of Ne. Because of the correlated sampling (see below) the log- likelihood differences between different values of Ne are typically better estimated, than suggested by the confidence intervals on the log-likelihood values themselves. We haven't yet built in a way to compute confidence intervals on the log-likelihood differences. Sometimes the lower confidence interval will be reported as "-Inf". This occurs when the Monte Carlo variance is so high (either because the number of replicates is too low, or because the importance sampling algorithm used is not working efficiently with a particular dataset) that the lower limit of the confidence interval computed by the asymptotic theory is less than zero for the likelihood, and hence less than -infinity for the log-likelihood. Try increasing the number of replicates used. If that doesn't help, then the importance sampling algorithm may not work well with your dataset. See "KNOWN INEFFICIENCIES" below. RUNNING YOUR OWN DATA: Before running your own data, be sure to change the random number seeds in the file RandomSeedsIn.txt (that should reside in the directory after you ran the program on "testfile.txt") from "1 2" to any two positive integers of your choice. You should use those two random numbers for any runs of the same dataset because the program uses correlated sampling (a variance reduction technique) by using the same block of random numbers for the Monte Carlo evaluation at each different value of Ne. This is _not_ discussed in the Genetics article. A good strategy for running your own data is to start with a low number of reps, say 1000 or so, and try only a few values (say 5) of Ne across the range of plausible values. Then, the results from that run should allow you to narrow the range of Ne values you wish to compute a log-likelihood for. You should try to compute values of the log-likelihood for values of Ne near the peak of the log-likelihood curve and for other values of Ne out to those for which the log-likelihood has dropped by 3 or 4 from the maximum. Remember, including more values of Ne will increase the running time, as will increasing the number of Monte Carlo replicates for each value of Ne. A general rule of thumb for determining the number of Monte Carlo replicates is that if you wish to shrink the confidence intervals on the log-likelihood estimates by a factor of 1/n, then you should do n^2 times more replicates. This will not always hold, but is a convenient rule of thumb. RUNNING MCLEEPS "IN THE BACKGROUND" ON THE MAC: With some data, very long runs may be required. This program runs well in the background on the Mac. Thus, you may start the program and then work in other applications while it is executing. Execution of MCLEEPS may be slowed if the other tasks you are working on are computationally intensive. In this way, however, you needn't monopolize your processor with MCLEEPS. If you wish to suspend execution of MCLEEPS on the Mac temporarily, you may hit any key. The screen will alert you to the fact that the program is suspended. Execution will resume when you enter any character. KNOWN INEFFICIENCIES: This importance sampling algorithm is not faultless. It has difficulty in some situations. Most notably, alleles which do not appear in the first sample, but appear in large numbers in future samples give it trouble, meaning that even if you use very many Monte Carlo reps, the confidence intervals on the estimates may not shrink as desired. It also has difficulty with alleles that are present in relatively high numbers in early samples, but which are then lost from later samples. Addtionally, loci with many loci (<10) occasionally pose problems. We are working to improve the importance sampling in these cases. If you find that MCLEEPS isn't working well for your dataset, please notify Eric at eric.anderson@stanfordalumni.org so he can address the problem in a future version of MCLEEPS.