Structure Lab
The purpose of this lab is to become familiar with the structure
program for inferring population subdivision, in particular for
determining the number of genetically-differentiated populations
(K). The analysis should be run with different values of K.
Ideally, for each value of K you would run the analysis multiple
times to be sure they produce similar results, perhaps varying
burnin and run length, but I have already identified reasonable
burnin and run length for these data (larger values don't change
the results). Every run (by default) starts with a different
random number seed, so even the same parameters will produce
slightly different outputs, and occasionally startlingly different
outputs.
From the documentation for the
program, read Section 5, particularly 5.1.
1. Run structure with K= 2, 3, and 4 on data in Ros_Amer_26x223.txt. The data are
from the Rosenberg lab paper Wang et al. 2007 "Genetic Variation
and Population Structure in Native Americans". I removed a lot of
individuals and some loci with a lot of missing data to make it
run in a reasonable class time. Even with this reduced data set,
the runs take quite a while (5-10 minutes), so you may want to
share results or you may have to continue runs after class to get
all your results. In case of complete chaos, I have already
completed multiple runs at each K and I can pack up the result
sets and provide them to you.
Each pair of lines in the data file has the genotype information
for an individual for each locus (one allele on each of the two
lines), plus some identifying data. Each allele at each locus is
coded as a number (there are many alleles, as is typical for
microsatellites). A -9 value indicates missing data.
One line from the file has the format:
individual <tab> popTag <tab> populationName
<tab> locationCountry <tab> locationGlobal <tab>
locus 1 allele <tab> locus 2 allele ... <tab>
locus 223 allele
The "population name" is the group (ethnic group, people, tribe,
whatever you want to call it) in which the sampled individual
resided. Not all such groups are expected to be genetically
well-differentiated from all other groups. popTag is just a number
corresponding to each group.
You will need to edit the parameter file "mainparams": some of
the following values may need to be changed (you can set input and
output files and K at the command line):
NUMINDS 26
NUMLOCI 223
LABEL 1
POPDATA 1
POPFLAG 0
EXTRACOLS 3
MARKERNAMES 0
There is no need to set MAXPOPS (the same as K) because we will
set it from the command line and that overrides the mainparams
setting.
Based on preliminary runs by me on these data, set:
BURNIN 20000
NUMREPS 100000
Strangely there seems to be no way to set thinning interval (c);
I assume it is always 1.
A typical run from the command line will be something like
this: structure -K 2 -i Ros_Amer_26x223.txt -o outK2
Make sure to use different names for every output file that you
want to keep, else they will be over-written. "_f" seems to always
be added to the output file name no matter what you do, so the
actual output file above would be "outK2_f".
The output file has a bunch of stuff we won't use, but the parts
we will use are pretty obvious and simple (section headed
"Proportion of membership of ...", the line "Estimated Ln Prob of
Data", and the section headed "Inferred ancestry of individuals").
I have already made multiple runs for each K and will display
some summary statistics in class. If you get into this (just for
fun) and you have python installed on your computer I can send you
a few simple python scripts that automate runs and summarize
output data (for summary you will need to have numpy and scipy
installed).
1. Which value of K provides the best fit to the data? (see
documentation section 5.3). How many genetically-differentiated
populations (not ethnic groups) do you conclude are in the sample?
For the best value of K, are any ethnic groups genetically similar
to each other? Are any ethnic groups split into separate genetic
populations? (for this last question it is interesting to look at
K=4 as well).
2. Optional - if you want make pretty figures, you can use the
distruct program (possibly after sorting the data using CLUMPP)
both from Noah
Rosenberg's software page.