Structure_Lab

Structure Lab

The purpose of this lab is to become familiar with the structure program for inferring population subdivision, in particular for determining the number of genetically-differentiated populations (K). The analysis should be run with different values of K. Ideally, for each value of K you would run the analysis multiple times to be sure they produce similar results, perhaps varying burnin and run length, but I have already identified reasonable burnin and run length for these data (larger values don't change the results). Every run (by default) starts with a different random number seed, so even the same parameters will produce slightly different outputs, and occasionally startlingly different outputs.

0. Download and install Structure 2.3.4 (without graphical front end) from http://pritchardlab.stanford.edu/structure.html

From the documentation for the program, read Section 5, particularly 5.1.

1. Run structure with K= 2, 3, and 4 on data in Ros_Amer_26x223.txt. The data are from the Rosenberg lab paper Wang et al. 2007 "Genetic Variation and Population Structure in Native Americans". I removed a lot of individuals and some loci with a lot of missing data to make it run in a reasonable class time. Even with this reduced data set, the runs take quite a while (5-10 minutes), so you may want to share results or you may have to continue runs after class to get all your results. In case of complete chaos, I have already completed multiple runs at each K and I can pack up the result sets and provide them to you.

Each pair of lines in the data file has the genotype information for an individual for each locus (one allele on each of the two lines), plus some identifying data. Each allele at each locus is coded as a number (there are many alleles, as is typical for microsatellites). A -9 value indicates missing data.

One line from the file has the format:

individual <tab> popTag <tab> populationName <tab> locationCountry <tab> locationGlobal <tab> locus 1 allele <tab> locus 2 allele ... <tab> locus 223 allele

The "population name" is the group (ethnic group, people, tribe, whatever you want to call it) in which the sampled individual resided. Not all such groups are expected to be genetically well-differentiated from all other groups. popTag is just a number corresponding to each group.

You will need to edit the parameter file "mainparams": some of the following values may need to be changed (you can set input and output files and K at the command line):

NUMINDS    26
NUMLOCI   223
LABEL        1
POPDATA    1
POPFLAG    0
EXTRACOLS    3
MARKERNAMES    0

There is no need to set MAXPOPS (the same as K) because we will set it from the command line and that overrides the mainparams setting.

Based on preliminary runs by me on these data, set:

BURNIN 20000
NUMREPS 100000

Strangely there seems to be no way to set thinning interval (c); I assume it is always 1.

A typical run from the command line will be something like this: structure -K 2 -i Ros_Amer_26x223.txt -o outK2

Make sure to use different names for every output file that you want to keep, else they will be over-written. "_f" seems to always be added to the output file name no matter what you do, so the actual output file above would be "outK2_f".

The output file has a bunch of stuff we won't use, but the parts we will use are pretty obvious and simple (section headed "Proportion of membership of ...", the line "Estimated Ln Prob of Data", and the section headed "Inferred ancestry of individuals").

I have already made multiple runs for each K and will display some summary statistics in class. If you get into this (just for fun) and you have python installed on your computer I can send you a few simple python scripts that automate runs and summarize output data (for summary you will need to have numpy and scipy installed).

1. Which value of K provides the best fit to the data? (see documentation section 5.3). How many genetically-differentiated populations (not ethnic groups) do you conclude are in the sample? For the best value of K, are any ethnic groups genetically similar to each other? Are any ethnic groups split into separate genetic populations? (for this last question it is interesting to look at K=4 as well).

2. Optional - if you want make pretty figures, you can use the distruct program (possibly after sorting the data using CLUMPP) both from Noah Rosenberg's software page.