DOCUMENTATION FOR PROGRAM MCLEEPS VERSION 1.1
August 18, 2001

Monte Carlo Likelihood for Estimating Effective Population Size

Program authored by 
	Eric Anderson, University of Washington, Seattle.
	Ellen Williamson, University of California, Berkeley.
	
Send questions and bug reports to 
	Eric Anderson  (eric.anderson@stanfordalumni.org)
	

---------------------------------------------	
CONTENTS	
---------
	-DIFFERENCES FROM VERSION 1.0
	-PURPOSE
	-COMPONENTS
		*SOURCE CODE DISTRIBUTION
		*MAC EXECUTABLE DISTRIBUTION
	-UNIX INSTALLATION
	-COMPILING ON OTHER PLATFORMS
	-INPUT DATA FILE
	-TESTING THE PROGRAM
	-THE OUTPUT
	-A NOTE ON THE CONFIDENCE INTERVALS
	-RUNNING YOUR OWN DATA
	-RUNNING MCLEEPS "IN THE BACKGROUND" ON THE MAC
	-KNOWN INEFFICIENCIES
----------------------------------------------

DIFFERENCES FROM VERSION 1.0:
	Version 1.1 represents only a small change from 1.0.  Namely, a bug in function
	RealizeNew() was fixed.  This bug occurred with certain types of datasets.  I thank
	Loic Degen at the Laboratoire de Biologie de la Conservation in Lausanne for alerting
	me to this bug.

PURPOSE:
	MCLEEPS uses an importance sampling method to compute the likelihood for
	the population genetics parameter Ne, given data on temporal changes of
	allele frequencies in a population.  The details of the algorithm
	appear in:
	
	Anderson,E.C., Williamson,E.G. and Thompson,E.A. (2000) Monte Carlo evaluation of the
		likelihood for Ne from temporally-spaced samples. Genetics 156:2109-2118.

	
COMPONENTS:
	SOURCE CODE DISTRIBUTION:
		The source code distribution comes in an archive called "mcleeps_src.tar.gz".
		After decompressing and extracting the archive you should have the following directories
		and files:
			/MCLEEPS_V_1.1_src
				Readme_MCLEEPS_V_1.1.txt  (this file)
				
				/Sources
					Makefile
					ne_classes.cpp
					ne_headers.cpp
					com.c
					linpack.c
					ranlib.c
					ranlib.h
					testfile.txt
					
				/RANLIB
					(Things that come with the distribution of RANLIB)
					
				/TestCompare
					TC_SettingsAndDataOut.txt
					TC_LogLikelihoodsOut.txt
					TC_LocByLocLogLOut.txt
					TC_LocusHistogramsOut.txt
	
	MAC EXECUTABLE DISTRIBUTION:
		The Mac executable distribution comes in a Stuffit Archive named "mcleeps.sit".
		This will open with Stuffit Expander.  After unstuffing you should have the following
		folders and files:
			/MCLEEPS_V_1.1_Mac
				MCLEEPS_Mac_1.1
				testfile.txt
				
			/TestCompare
				TC_SettingsAndDataOut.txt
				TC_LogLikelihoodsOut.txt
				TC_LocByLocLogLOut.txt
				TC_LocusHistogramsOut.txt
		
		
	The directory RANLIB contains the standard distribution for the RANLIB routines
	which are in the public domain.  Since I use some source code from RANLIB
	(namely ranlib.c linpack.c and com.c) I included the whole distribution of RANLIB 
	in the source code distribution of MCLEEPS for completeness and documentation.  
	Nothing in the directory RANLIB is actually used, however, and it may be deleted if desired.  

	The directory /TestCompare contains output files for the test dataset.  These are for
	comparison to results obtained on your computer to make sure that the program is working
	correctly on your machine.
		
		
UNIX INSTALLATION:
	To compile the program into an executable named MCLEEPS_V_1.1, go to the
	/Sources directory and type "make".  This will execute the compiling and linking commands
	in the Makefile.  There should be a few warning messages from the ranlib code,
	but none from ne_classes.cpp.  These sources are compiled with g++.  To use the 
	platform-specific compiler, modify the Makefile appropriately.  
	
COMPILING ON OTHER PLATFORMS:
	If you have a C++ compiler on another platform (Windows, for example) you should
	be able to read the Makefile in the SOURCE CODE DISTRIBUTION and figure out how
	to compile the source code into an executable on your platform.  
	
	
INPUT DATA FILE:
	The data you want analyzed must appear in a file named whatever you want to name it,
	so long as it is all one word (No whitespace!!) and does not include punctuation other
	than the underscore character and periods.  For example, valid filenames would be,
	"flies2000" or "salmon.1996" or "input_file.in".  Invalid names would be, for example,
	"last year's genes" or "temporal data".  Notice: the quotation marks are not part of
	the filenames.  
	
	You should put your data file in the same directory or folder as the executable 
	program MCLEEPS.  The input file must be a white-space delimited
	text file.  The format of the data within it MUST correspond to the data format in
	"testfile.txt".  An explanation of that format is given below.  Note that the 
	"<--"'s and the explanations following them cannot appear in the input file itself.  
	
	EXPLANATION OF DATA INPUT FORMAT FOR FILE "testfile.txt" 
	
		# Simulated data on 4 loci, with 5 alleles each,  to test the program.  
		# You may include comments preceded by the '#' symbol, but only before you
		# reach the actual data in the file
		1000         <-- number of Monte Carlo replicates for each value of Ne
		40	160	 20  <-- (diploid) Ne values for which to compute likelihood: 40 to 160, in steps of 20
		4            <-- number of loci
		0	4	8    <-- generations when samples were taken.  First is always 0.

		6	20	72	100	2    <-- First Locus allele counts (of all alleles) at sample from generation 0
		10	32	102	56	0    <-- First locus allele counts at next sample (generation 4)
		3	14	93	90	0    <-- First locus allele counts at third (and last) sample (generation 8)

		13	58	24	88	17   <-- Second locus allele counts at sample from generation 0. 
		9	68	35	76	12   <-- etc.
		10	56	23	111	0

		5	42	83	2	68
		0	32	91	7	70
		0	21	89	5	85

		16	10	168	5	9
		9	15	165	5	6
		20	37	121	4	18

	
	
TESTING THE PROGRAM
	Before you try running your own data through the program, you should run the 
	data in "testfile.txt" to make sure that the program is running properly on
	your machine.  To do so, follow these steps:
		(1) Launch the program
		(2) When prompted for the input file name, type "testfile.txt" (remember,
			you don't type the quotation marks).
		(3) The program should then tell you that you need two random number seeds.
			Enter the values 1 and 2.
		(4) You should then be prompted to inspect the file "aa_SettingAndDataOut.txt"
			to make sure it resembles the input in "testfile.txt".
			Do this.  It should further familiarize you with what goes into the 
			data input file.
		(5) Then, enter any character and let the program execute.  
	It takes about 2.5 minutes to complete execution on a 266 Mhz G3 laptop, so
	it should be faster on most newer machines and Unix machines.  
	
	When execution has completed, you should compare the output files to those in
	the directory /TestCompare.  If the output isn't identical, (save for the differences
	in program start and stop times), check to make sure
	that the random seeds used were 1 and 2 (in that order).  Please report any 
	differences to eric.anderson@stanfordalumni.org.
	

THE OUTPUT:
	The program overwrites the output files each time it is run.  This means that
	after a run of the program you should MOVE THE OUTPUT FILES TO A DIFFERENT DIRECTORY
	if you want to save them.  Otherwise the results of the previous run will be lost if 
	you re-run the program.  The output comes in four different files which are space-delimited
	text files which may be easily imported into graphing or spreadsheet programs:
		
		aa_SettingsAndDataOut.txt:
			This file holds information about the settings and data for the run.  It is 
			largely self-explanatory. 
		
		aa_LogLikelihoodsOut.txt
			This file gives the Monte Carlo log-likelihood estimates and confidence intervals
			for the different values of Ne.  The first column is the diploid effective size;
			the second column is the log-likelihood; the third column is the lower endpoint
			of an approximate 95% confidence interval around the Monte Carlo estimate of the
			log-likelihood;  The fourth column is the upper endpoint of the 95% confidence
			interval.  
			
		aa_LocByLocLogLOut.txt
			This file is in a similar format, however there is a set of three columns
			for each of the loci in the data.  This way you can look at the log-likelihood
			for Ne on a locus by locus basis.  It is also useful for seeing which of 
			the loci have data for which an accurate estimate of the log-likelihood for 
			Ne is difficult---these will be the ones with wider condfidence intervals. Note
			that the overall log-likelihood is the sum of the log-likelihood for each
			locus, but the confidence interval endpoints on the overall likelihood are
			not the sum of the locus-specific confidence interval endpoints.  
			
		aa_LocusHistogramsOut.txt 
			These are histograms for each locus and each value of Ne of the log of the 
			contribution to the Monte Carlo average.  There is only output here for 
			number of reps greater than 1,000.  Thus, there will be no numerical output
			for the testfile.  The information in this file can be used to assess whether 
			the importance sampling is unstable (as it might be with pathological data) or
			not.  See the section on importance sampling in Gelman, Carlin, Stern, and 
			Rubin (1996), "Bayesian Data Analysis," New York: Chapman and Hall for more 
			information about this.  
	
	NOTE: if you are running on the Mac, make sure that all files with the above
	names in the directory from which you are launching MCLEEPS_Mac_1.1 are not
	currently opened by any applications.  If so, the program will be unable to
	modify them.  However, once you have started  MCLEEPS_Mac_1.1, you may open 
	the files with other applications (e.g. a text editor) to monitor progress.  
			
A NOTE ON THE CONFIDENCE INTERVALS:
	The 95% confidence intervals are approximate, based on an asymptotic normal approximation.
	They try to give confidence levels for the estimation of the log-likelihood at
	particular values of Ne.  Because of the correlated sampling (see below) the log-
	likelihood differences between different values of Ne are typically better estimated,
	than suggested by the confidence intervals on the log-likelihood values themselves.
	We haven't yet built in a way to compute confidence intervals on the log-likelihood
	differences.  
	
	Sometimes the lower confidence interval will be reported as "-Inf".  This occurs when 
	the Monte Carlo variance is so high (either because the number of replicates is too low,
	or because the importance sampling algorithm used is not working efficiently with a
	particular dataset) that the lower limit of the confidence interval computed by the 
	asymptotic theory is less than zero for the likelihood, and hence less than -infinity
	for the log-likelihood.  Try increasing the number of replicates used.  If that 
	doesn't help, then the importance sampling algorithm may not work well with your
	dataset. See "KNOWN INEFFICIENCIES" below.
			
RUNNING YOUR OWN DATA:
	Before running your own data, be sure to change the random number seeds in the file
	RandomSeedsIn.txt (that should reside in the directory after you ran the program on
	"testfile.txt") from "1  2"  to any two positive integers of your choice.  You should
	use those two random numbers for any runs of the same dataset because the program
	uses correlated sampling (a variance reduction technique) by using the same block of
	random numbers for the Monte Carlo evaluation at each different value of Ne.  This is
	_not_ discussed in the Genetics article.  
	
	A good strategy for running your own data is to start with a low number of reps, say
	1000 or so, and try only a few values (say 5) of Ne across the range of plausible values.
	Then, the results from that run should allow you to narrow the range of Ne values you wish
	to compute a log-likelihood for.   You should try to compute values of the log-likelihood
	for values of Ne near the peak of the log-likelihood curve and for other values of Ne 
	out to those for which the log-likelihood has dropped by 3 or 4 from the maximum.
	
	Remember, including more values of Ne will increase the running time, as will increasing
	the number of Monte Carlo replicates for each value of Ne.  
	
	A general rule of thumb for determining the number of Monte Carlo replicates is that
	if you wish to shrink the confidence intervals on the log-likelihood estimates by a 
	factor of  1/n, then you should do n^2 times more replicates.  This will not always hold,
	but is a convenient rule of thumb.
	
RUNNING MCLEEPS "IN THE BACKGROUND" ON THE MAC:
	With some data, very long runs may be required.  This program runs well in the
	background on the Mac.  Thus, you may start the program and then work in other applications
	while it is executing.  Execution of MCLEEPS may be slowed if the other tasks you are 
	working on are computationally intensive.  In this way, however, you needn't monopolize
	your processor with MCLEEPS.  If you wish to suspend execution of MCLEEPS on the Mac 
	temporarily, you may hit any key.  The screen will alert you to the fact that the program
	is suspended.  Execution will resume when you enter any character.  

KNOWN INEFFICIENCIES:
	This importance sampling algorithm is not faultless.  It has difficulty in some situations.  
	Most notably, alleles which do not appear in the first sample, but appear in large numbers
	in future samples give it trouble, meaning that even if you use very many Monte Carlo reps, the
	confidence intervals on the estimates may not shrink as desired.  It also has difficulty
	with alleles that are present in relatively high numbers in early samples, but which are then
	lost from later samples.  Addtionally, loci with many loci (<10) occasionally pose problems.
	We are working to improve the importance sampling in these cases. If you find that MCLEEPS 
	isn't working well for your dataset, please notify Eric at eric.anderson@stanfordalumni.org 
	so he can address the problem in a future version of MCLEEPS.