| [Top] | [Contents] | [Index] | [ ? ] |
This is a tutorial for the MORGAN (Monte Carlo Genetic Analysis) package, focusing on how to use the provided programs.
The current version of this document is written for MORGAN Version 2.8.1.
| 1.1 Overview of MORGAN | ||
| 1.2 Get the Tutorial | ||
| 1.3 Get and set up the examples | ||
| 1.4 Overview of the pedigrees used in the examples | ||
| 1.5 Future directions of MORGAN |
MORGAN (Monte Carlo Genetic Analysis) is a collection of programs and libraries developed at the University of Washington under the PANGAEA (Pedigree Analysis for Genetics and Epidemiological Attributes) umbrella. This software implements a number of methods for the analysis of data observed on members of a pedigree, with the main programs implementing Markov Chain Monte Carlo (MCMC) methods. As of the date of this tutorial, the latest MORGAN version is 2.8.1 which was released in April 2006. It is available for download through the MORGAN V2.6 home page at the Department of Statistics, University of Washington.
The MORGAN programs are grouped into four categories:
pedcheck checks for errors in pedigree structure and data format, see Checking Pedigree Validity. kin computes kinship and inbreeding coefficients for members of the pedigree, see Computing Kinship and One- or Two-Locus Inbreeding Coefficients.
genedrop simulates data on a pedigree for analysis by other programs, see Simulating Marker and Trait Data in Pedigrees. markerdrop simulates marker data at loci linked to a potential trait locus, see Simulating Marker Data Conditional on Trait Data in Pedigrees. ibddrop uses Monte Carlo to estimate gene ibd (identity by descent) probabilities in the absence of data, see Estimating a priori ibd Probabilities by Monte Carlo.
lm_auto and
lm_pval, estimate conditional gene ibd probabilities, see
Estimating Conditional ibd Probabilities by MCMC. The Lodscore
programs, lm_lods, lm_markers, lm_bayes and
lm_schnell estimate multilocus LOD scores, see Estimating Location LOD Scores by MCMC. A brief introduction to the MCMC
techniques employed by MORGAN can be found in Using MCMC to Estimate Parameters of Interest in Pedigree Data.
univar, unibig, bivar and multivar, see Polygenic Modeling of Quantitative Traits by EM Algorithm.
This tutorial is based on the computing notes of Dr. Elizabeth Thompson. Descriptions of the parameter statements are taken from the `README_userdoc' files written by Myrna Jewett. The original version of this document was written in 2002 by Michael Na Li, and revised by Myrna Jewett and Adele Mitchell in 2006. The current version is revised by Audrey Q. Fu.
Combined with hands-on examples, this tutorial gives a brief introduction to the usage of the main MORGAN programs. For further information, please refer to the MORGAN documentation and to the references cited.
This tutorial is available on-line at
http://www.stat.washington.edu/thompson/Genepi/MORGAN/Morgan.shtml#tut
Several formats of this tutorial are also available to download for off-line reading or printing.
This tutorial assumes that someone else has already installed the MORGAN software somewhere. If not, please contact your local system administrator or download the software yourself and follow the instructions therein.
Follow the following steps to download and set up the examples:
user$ tar zxvf morgan-examples.tar.gz |
Or if the above command fails (you don't have GNU tar), use
user$ gunzip -c morgan-examples.tar.gz | tar xvf - |
This will produce a `MORGAN_Examples' directory under your current directory.
(Note: Throughout the text file and directory names are enclosed in single quotes; these single quotes are not part of the file or directory name.)
Before making links, you first need to edit the `Makefile' (using your favorite text
editor, for instance vi or pico) in the
`morgan-examples' directory to make sure the paths to your
MORGAN programs and those to the `MORGAN_Examples'
directory are correct. Most often, it is necessary
to change the `MORGANDIR' and `EXAMPLEDIR' statements to reflect
the locations of the MORGAN files on your system and the examples,
respectively. Here is the relevant part of the `Makefile',
# Change the following macros to where MORGAN and the examples # are installed on your system. # This is the only change you need to make in this file. MORGANDIR = /h4/audrey/MORGAN_V2.8.1 EXAMPLEDIR = /h4/audrey/MORGAN_Examples BINDIR = /h4/audrey/bin # Note: the paths may happen to be same for MORGANDIR and EXAMPLEDIR. # In general they are different: # MORGANDIR is where MORGAN is installed on your system # EXAMPLEDIR is the MORGAN_Examples directory you have made # BINDIR is your bin directory # BINDIR is needed only if you prefer to link to executables from your # bin directory, rather than running from a current directory: for # example, if your current directory is not in your standard path. |
For more information on how to use Makefile to build links, etc., you may type:
make help |
To make symbolic links to those programs in the current directory, type
make links |
Notes for Microsoft Windows users:
MORGAN may be (in principle) installed under Windows: executables should then be placed in the directory in which programs are to be run. See the documentation for more information. We cannot currently answer any questions regarding Windows installation. Instead, we recommend the use of a linux-system emulator such as Cygwin .
Except for some small pedagogical pedigrees for pedcheck under
`Pedcheck', two main pedigree files
are used to illustrate the usage of MORGAM programs.
File `jv_rep.ped', located under `IBD', is composed of two replicates of the JV pedigree. The 30-individual 5-generation JV pedigree derives from a real study of a rare recessive trait by Goddard et al (1996 AJHG 58: 1286-1302).
The other pedigree in `ped73.ped', located under `MORGAN_Examples', consists of three components and 73 individuals: component one has 47 individuals from 6 generations, component two 11 individualus from 3 generations, and component three 15 individuals from 3 generations. In general, individuals from later generations are observed. The three components are displayed in `ped47.pdf', `ped11.pdf' and `ped15.pdf'.
Only functions affecting general MORGAN
capabilities or enabling new
programs are listed here. Many other general improvements in
Morgan library functions and MORGAN
set-up routines have been made.
Some details may be found in MORGAN
web release notes, or in the
README_relnotes of each MORGAN
release.
In MORGAN V2.8.2, forward HMM computation for multiple meioses has been replaced by a factored version (FHMM), enabling much faster exact computation on small pedigree components and multiple-meiosis sampling for larger numbers of meioses.
Exact computation of lodscores on small pedigree components
has been implemented for lm_markers: computation uses the FHMM
version of the Baum algorithm.
MM-sampler updates multiple meioses jointly and is therefore a generalization of the meiosis sampler (M-sampler). There are four types of update in MM-sampler: random meiosis update, individual update, sib update and 3-generation update. This is based on work by Liping Tong.
For more detail on L- and M-samplers, see Using MCMC to Estimate Parameters of Interest in Pedigree Data.
Up to MORGAN V2.8.2, MCMC was performed globally over pedigree components (except those small enough for exact computation). The L-sampler peeling and lod score estimation could be done either by component (using "set peeling by component") or globally (the default).
With MORGAN V2.8.3, and specifically to accommodate the new
lm_haplotype program, the preferred option is to do both MCMC and
pedigree peeling (lod score estimation) by component, and to use exact
computation on all sufficiently small component pedigrees. The
alternative, retained so that older data sets can be rerun, is to
use "set global MCMC", in which case no exact computation will be
done, and MCMC will be done globally over all component pedigrees.
In this case, the "set peeling by component" option is retained.
As yet, loci are either multiallelic marker loci assumed observed
without error, or trait loci which may have general penetrance functions
but are diallelic. In order to allow models for "non-genotypic"
markers, general joint peeling programs have been implemented, based
on Thompson (1976: UU Tech Rept, #6). These peeling routines are used
by the lm_map program which allows for errors in marker data. They are
not yet released, as they are still in process of testing: they may be
released in MORGAN V2.8.3.
In MORGAN V2.8.2,
liability penetrances (previously available only for lm_bayes)
have been implemented for lm_markers. Penetrances for each
liability class are now read from an input file using the
"input extra data file S" parameter statement.
Additionally, an age-based penetrance function for a qualititative trait has been implemented. That is, penetrances are directly dependent on age, rather than going through a liability class specification.
In two-locus models for a quantitative trait, penetrances may be specified as additive, with a genotypic mean for each trait genotype for each locus. Alternatively, a matrix array of 2-locus genotypic means may be specified, allowing for epistasis (see Sung & Wijsman, 2007, Human Heredity 63: 144-152.).
With more complex trait models, including those of
lm_twoqtl
(see Sung et al., 2007, Genetic Epidemiology 31: 103-114), a more general
specification of traits is required. In MORGAN V3.0,
completely new structures have been introduced, separating traits
(phenotypes) from trait loci ("tlocs"). Traits may be affected by
genotypes at several tlocs; the genotypes at a tloc may affect several
traits. This more general structure is not yet released - see
lm_twoqtl below.
lm_map with and without error models:
Some updates and corrections to the lm_map program are made
in MORGAN V2.8.2. Additonally under development is a version
of lm_map which allows for error in observation of marker
genotypes. This version may be released with MORGAN V2.8.3.
The version of the program lm_pval released in
MORGAN V.2.8 and subsequent, and described in this tutorial,
uses the latent p-value distribution of Thompson & Geyer (2007,
Biometrika). Additional programs using these ideas are under
development, including programs for the distribution of latent lod
scores obtained in MCMC sampling (lm_fuzlod),
p-values and randomized tests
based on latent lod score statistics (lm_fzplod),
and randomized confidence sets
for the location of a trait locus (lm_fzconf). These are working
names only. The methods are described in Thompson (2007: submitted).
In MORGAN V2.8.3, Gold replaces the previous
Gold2
subdirectory, for tests of
lm_auto, lm_pval,
lm_map and lm_ibdtests.
Gold1 lm_auto tests remain temporarily,
since they provides the only tests of MCMC samplers on looped
pedigrees. Gold1 lm_auto
gold standards were omitted from the released MORGAN
V2.8.2, due to delays in checking looped pedigree peeling routines:
they will be reinstated in MORGAN V2.8.3.
To make room for new Lodscore programs being released in
MORGAN V2.8.2, 2.8.3 and 3.0, the two older programs
lm_schnell and lm_lods have been moved to the
new directory LR_Lods. These two programs differ in several ways
from newer programs, but the principal one is that they use the
methods of combining likelihood ratios (LR) along the chromosome
in order to estimate lod scores (see Thompson & Guo, 1991, IMA J
Math Appl in Med & Biol).
The Gold1 subdirectory remains temporarily,
since it provides the only tests of MCMC samplers on looped
pedigrees. Gold1 lm_lods
gold standards were omitted from the released MORGAN
V2.8.2, due to delays in checking looped pedigree peeling routines:
they will be reinstated in MORGAN V2.8.3.
lm_multiple and lm_markers
A new lod score calculation program lm_multiple was released in
MORGAN V2.8.2 (Spring, 2006). The lm_markers program
is still made as a separate executable, but is compiled as a special
case of lm_multiple code. In V2.8.2,
both programs perform exact lodscore
computations on small pedigree components. In V2.8.3 (unreleased) this
is optional: see lm_haplotype below.
The difference between lm_multiple and lm_markers is that
the new lm_multiple substitutes the old single-meoisis M-sampler
updates for the new multiple-meiosis (MM) sampler that is
the work of Liping Tong (Tong & Thompson, 2007, Human Heredity:in
press): see above.
lm_multiple runs with the same parameter file and other input
files required
by lm_markers. The output is also essentially the same as that from
lm_markers.
More information on lod score calculation programs in MORGAN V.2.8.1 (and previous) can be found in Estimating Location LOD Scores by MCMC.
lm_haplotype
The lm_haplotype program is a generalization of
lm_multiple in which haplotypes of key individuals dividing the
pedigree are sampled in addition to meiosis indicators. To facilitate
efficient implementation of this algorithm, new peeling-by-component
routines need to be implemented and checked. This program is the work
of Liping Tong (Tong & Thompson, 2007, Human Heredity:in
press). The program is in process of release: probably in
MORGAN V2.8.3.
In MORGAN V2.8.2,
Gold standards for exact computation and for lm_multiple
are added in the Lodscore/Gold2 subdirectory.
In MORGAN
V2.8.3, the Gold directory
replaces the previous Gold2 directory. Gold directories for
lm_lods
and lm_schnell are moved to the new LR_Lods program
directory. Thus Gold1 no longer exists in Lodscore.
lm_twoqtl
The new program lm_twoqtl allows two (linked or unlinked)
quantitative trait loci to contribute additively or epistatistically
to a single trait (see Sung et al., 2007, Genetic Epidemiology
31: 103-114).
An unreleased version of this program exists under MORGAN
V2.8, but a user-friendly version requires a new MORGAN
structure in which traits are separated from trait loci or "tlocs": see
above. This new structure is implemented in
MORGAN V3.0. The program lm_twoqtl will be the first
to be released under this structure, although eventually all current
programs will be converted.
All MORGAN programs use the same command line syntax, share many statements, and use the same pedigree data format. Most of the MORGAN programs need at least two input files in order to run: one parameter file and one pedigree data file. The parameter file contains computing requests, model parameters and input/output file options. It may also contain genotype data or other information specific to a particular MORGAN program. The pedigree file contains, at minimum, information on family relationships among the individuals in the sample.
It is worth pointing out that white space in any input file is defined to be any of these characters: `,' (comma), `\t' (horizontal tab), `\v' (vertical tab), `\n' (line feed, or newline), `\f' (form-feed), `\r' (carriage return).
| 2.1 Command syntax | ||
| 2.2 Parameter file | ||
| 2.3 File identification statements | ||
| 2.4 Pedigree file | ||
| 2.5 Pedigree file description statements |
The parameter file name must be passed to MORGAN on the command line when calling the program. Other file names can be passed to MORGAN on the command line or in the parameter file. The minimum syntax to call a MORGAN program is:
progname parfile |
In the statement above, progname is the name of one of several
MORGAN main programs, such as genedrop or
lm_lods. The parfile is the name of the parameter file
which must be present. For example, to run genedrop using a parameter file named `genedrop.par', the command is:
|
Additional file names can be passed to MORGAN on the command line, but these file names must be accompanied by a file type to identify them. The syntax is:
progname parfile [filetype filename]... |
Square brackets indicate optional arguments. Possible filetype options include:
pedInput pedigree file
markInput marker data file (Note that only Autozyg programs use marker data)
opedOutput pedigree file
seedInput seeds for random number generator
oseedOutput random seeds
oscorOutput score file
If the name for a particular file type is given both in the command line and in a parameter statement, the name in the command line takes precedence.
The programs put informational messages to stdout and error
messages to stderr which default to the screen. It is possible to
redirect either or both to a named file.
A MORGAN parameter file contains a series of statements. Many statements are common to all MORGAN programs, particularly those that define the format of the pedigree file and identify other files to be used for program input or output. Many statements are optional, with some default behavior. If statements irrelevant to the MORGAN program called by the user are included in the parameter file, those statements are ignored and a warning message is issued.
Each statement must begin on a new line and begins with one of the MORGAN statement keywords. A statement consists of any number of lines. Case is not significant for the keywords. Only the first four letters of the keywords are significant; the remainder of the word is ignored. The order of the statements does not matter. If the same statement is repeated, the last one overrides previous ones and a warning is given in the output file. A # starts a comment so that the rest of the line is ignored. Either single or double quotation marks (' or ") can be used to delimit strings such as file names. Look at the warnings issued by MORGAN to make sure the parameters are as you intended.
The most common statements are for identifying input and output files (counterparts of the command line options) and for describing the input pedigree file format.
Below is a simple parameter file, `simple.par', from the examples included with the MORGAN software.
input pedigree file `simple.ped' output pedigree file `simple.out' output pedigree chronological |
A brief description of the most commonly used parameter file statement follows in the next section. For a complete and more detailed description of MORGAN statements, please see the sections of this tutorial relevant to specific MORGAN programs and the documentation that comes with MORGAN in the files `README_userdoc' in the various program subdirectories.
Within the parameter file, file names are delimited with single or double quotation marks (` or "). File names submitted on the command line are not delimited with quotation marks. In a parameter file, either of the two statements below would identify `pedchk.ped' as the pedigree file to be read.
input pedigree file "pedchk.ped" input pedigree file `pedchk.ped' |
The most commonly used file identification statements are:
input pedigree file filenameThe input pedigree file is required for most programs and may be specified either in the parameter file or through command line options.
output [overwrite] pedigree file filenameThe output pedigree file is required by genedrop. Other programs
also check for errors in the pedigree. If there are errors that
the program is able to correct or if there are requested changes to the
pedigree file
format, the new pedigree data is written to this file.
input marker data file filenameMarker data, such as marker allele frequencies, map distances between markers and individuals' genotypes, can be included in the parameter file itself or in a separate file, called the marker data file. This statement is used when the marker data are not included within the parameter file. The marker file contains the `set marker data' statements. Marker data are used by Autozyg programs. See section Autozyg computational parameters.
input seed file filenameThis file contains statements to set random seeds for the Monte Carlo based programs. The seed file may contain multiple lines (as in the case when the input seed file is also used for the output seed file). If so, the seeds in the last line override previous ones (with warnings issued). If no seed file is named on the command line or in a parameter statement and there are no statements to set random seeds in the parameter file, default seeds (12345, 1073 (hexadecimal 0x3039, 0x431)) are used.
output [overwrite] seed file filenameThe final random seeds are saved if an output seed file is named. This file could be the same as the input seed file. New entries are appended to the old file.
The pedigree file may contain two sections, formatting statements and pedigree data, separated by the file separator `****'. The first section is optional; if present, it contains statements that describe the contents and format of the pedigree file, as some MORGAN users find it convenient to describe the pedigree data within the file itself. The alternative is to put these formatting statements in the parameter file.
The pedigree data begin below the file separator. Data for each individual must be placed on a separate line. Each line begins with three names, followed by integers, then real numbers. The only required fields, the three "names", are identifiers for each individual and his or her parents. Names may include up to 15 alphanumeric characters. Whitespace (comma, space, tabs, linefeed), single (') and double (") quotes, and the hash mark (#) cannot be included in names. Names longer than 15 characters are truncated to 15 characters. Pedigree founders should be given parents with names `0'.
Gender, if present, is the fourth item in each line. These three or four values may be followed by an "observed" indicator, with values of `0', indicating an unobserved individual, or `1', indicating an observed individual. The optional "observed" indicator is followed by other integers, if present, and real numbers, if present. Integers and real numbers can represent individuals' trait data.
The format of the file is flexible and is specified by the user with `input pedigree record …' statements, described in the next section.
Unlike LINKAGE format pedigree files, genotype data are not included in a MORGAN pedigree file.
Any of the following statements can be placed either in the parameter file or in the top section of the pedigree file, above the file separator, `****'. Most parameters have default values, in which case the statement is usually not required.
allow pedigree size NThis statement overrides the program-defined maximum pedigree size (presently 20,000 individuals).
input pedigree size NHere, N is the number of records to be read. It may be less than the actual number of individuals in the pedigree file.
input pedigree record names 3 [integers I] [reals J]This specifies the numbers of entries in each line of the pedigree file. There must be three names (up to 12 alphanumeric characters each) identifying an individual and his or her parents. Optional integers include gender and phenotypic or discrete trait data. Real numbers could be covariates or quantitative trait values.
input pedigree record (father mother | mother father)This statement specifies the order of parental names. `father mother' is the default.
input pedigree record gender (present | absent)Gender, which follows the required triplet of names, is optional. If this statement is not included, the default is `gender present'. Gender is coded as an integer, such that `1', `2' and `0' represent male, female, and unsexed, respectively.
input pedigree record observed (absent | present) The observed indicator designates which members of the pedigree are observed and which are unobserved, indicated by `1' and `0', respectively. When the observed indicator is present, it follows gender (or parents if gender is not present). If this statement is absent, all pedigree members are assumed to be observed.
input pedigree record traits K1 K2... integers I1 I2...This statement is needed when integer data for traits are included, and the trait values do not immediately and consecutively follow gender (if present). Use this statement to specify the correspondence between trait numbers and integers in the record.
Below are the first several lines of the sample pedigree file, `ped73.ped' in `MORGAN_Examples'.
input pedigree size 73 input pedigree record names 3 integers 6 reals 1 *************************************************** 101 0 0 1 0 0 0 -1 -1 999.5 102 0 0 2 0 0 0 -1 -1 999.5 201 101 102 1 0 0 0 0 1 999.5 |
Note that genotype data are not contained in the pedigree file; genotype data, if required for the MORGAN program invoked, are contained in the parameter file or in a marker data file specified in the parameter file using the `input marker data file' statement. The second parameter statement in the file, `input pedigree record names 3 integers 9 reals 1' describes the format of the data on each line (also called a record) in the file. The first three values in each row, the names, give an individual's identification number followed by those of his or her father, then mother. Because there is no `input pedigree record gender' statement, gender is assumed to be present and to directly follow the three names. Absence of an `input pedigree record observed' statement means that all individuals are assumed to be observed. The 8 integers following gender and the real number in the final column represent trait data. Lack of an `input pedigree record traits integers' statement indicates that the first integer following gender corresponds to trait 1, the second to trait 2, etc. These (and other) parameter statement defaults apply only if there is no overriding statement in any of the parameter files used. Programs will generally provide a warning statement (coded "(W)") when default values are being used due to absence of a relevant parameter statement.
3.1 Introduction to pedcheck | ||
3.2 Sample pedcheck parameter file | ||
3.3 Running pedcheck examples | ||
3.4 pedcheck statements |
pedcheck pedcheck reads the pedigree file and checks for errors in the
pedigree structure. Specifically, it checks for the following errors:
If no errors are found, pedcheck reports the number of components
(connected pedigrees) found and lists for each component:
If there are changes to the file, pedcheck writes an output
pedigree file. Requested changes may include reordering of the pedigree
chronologically (by component, then by name), the addition of gender, the addition of an observed
indicator, and reversing the order of the parental names.
Other MORGAN programs do their own pedigree checking by
calling the relevant pedcheck functions, but it is still useful
to do preliminary processing of data files first.
pedcheck parameter file Files for pedcheck may be found in the `Pedcheck'
subdirectory of
`MORGAN_Examples'.
Below is the sample parameter file `check.par' for pedcheck:
input pedigree file `check.ped' input pedigree size 30 input pedigree record gender absent input pedigree record observed present assign gender output pedigree chronological output pedigree file `check.oped' |
The `assign gender' statement requests that pedcheck
determine gender, when possible, and output that information to the
output pedigree file. The gender determination is made based on the
default order for the listing of parents, which is father followed by
mother. Notice that individuals who are not parents (531 and 541) have
missing gender, `0', in the fourth column of `check.oped'.
`output pedigree chronological' causes the pedigree to be sorted
into chronological order in the output pedigree file, first by
component, then by individual name. MORGAN refers to each
connected pedigree (i.e., distinct family) in a file as a component.
The first individual in the input listing who is not genealogically
connected to individual 1 defines component 2, and the first who is
not connected to either of these defines component 3, etc.
Although pedcheck groups individuals by their
MORGAN-assigned component numbers in the output pedigree
file, it does not list the component numbers. That is, the first three
columns of the output file are just as they were in the input file:
individual name, father's name, mother's name.
pedcheck examples Examples for the program pedcheck are under the subdirectory
`Pedcheck/'. The commands using example files are listed below.
Have a look inside the pedigree and parameter files, then verify that the
output files are as you would expect them to be. If error messages are
generated, verify that they make sense and see if you can make the
necessary corrections so that pedcheck will run.
runs on input pedigree file `check.ped'. The pedigree contains no errors, but has no gender specified and is not in chronological order. These deviations from the standard format must be specified in the parameter file. The output pedigree file `check.oped' has gender assigned and has the members reordered. You will get an error message and the program will quit if `check.oped' already exists. If this occurs, delete the file and try again or use another output file name.
runs on input pedigree file `imp.ped'. The pedigree contains an individual who is his own ancestor.
runs with an empty parameter file, with input pedigree file `sex.ped' specified on the command line. What does the output say is wrong with this pedigree?
runs with an empty parameter file, with input pedigree file `dup.ped' specified on the command line. What does the output say is wrong with this pedigree?
pedcheck statements pedcheck statements apply to other MORGAN programs since
the programs call the pedcheck functions first to check the
pedigree file before doing computations on the pedigree data.
(assign | ignore) genderOptional. `assign gender' causes gender to be determined by parentage, whether or not gender is included in the pedigree file. `ignore gender', causes the program to not check or assign gender. The default action is to assign gender when it is absent and to check gender if it is present.
output pedigree chronological Optional. If this statement is present and if the input file is not in chronological order, the pedigree is sorted and written out in chronological order. The pedigree is sorted by components, and within each component, each non-founding member is preceded by her or his parents. If this statement is not given, the input order is preserved in the output file, if written. See the previous section of this chapter for further discussion of pedigree components.
output pedigree record (father mother | mother father)Optional. This statement causes the parents to be named in the specified order. The default arrangement for each triplet of names is the input order.
4.1 Introduction to kin | ||
4.2 Sample kin parameter file | ||
4.3 Running kin example and sample output | ||
4.4 kin statements |
kin kin computes kinship coefficients for pairs of pedigree members.
It also computes single-locus and two-locus inbreeding coefficients for
members of the pedigree. Briefly, the kinship coefficient between
individuals i and j is the probability that a randomly-drawn
allele from i is identical by descent (ibd) to randomly-drawn
allele from individual j at the same locus. A single-locus
inbreeding coefficient is the probability that an individual carries two
copies of a gene that are ibd, at a given autosomal locus. In other
words, an individual's single-locus inbreeding coefficient is equal to
the kinship coefficient of his parents, as an individual's gametes can
be thought of as random draws from his parents' chromosomes. A
two-locus inbreeding coefficient is the probability that an individual
carries two ibd copies of a gene at each of two linked loci.
kin presents two-locus inbreeding coefficients as a function of
the recombination fraction between the two loci.
kin parameter file Files for kin may be found in the `IBD' subdirectory of
`MORGAN_Examples'.
Below is a sample kin parameter file, `jv_rep_kin.par'.
input pedigree file 'jv_rep.ped' compute component 1 kinship coeff 531 431 431 432 compute component 2 inbreeding coeff 441 541 compute component 1 two-locus inbreed coeff 531 compute component 2 two-locus inbreed coeff 541 set recomb freqs .01 .04 .05 .10 .18 .30 .50 .0 |
The statements on lines 2 - 5 request computation of kinship coefficients for the pairs `531 431' and `431 432' from component 1, inbreeding coefficients for individuals `441' and `541' from component 2, and the two-locus inbreeding coefficient for `531' from component 1 and `541' from component 2. The two-locus inbreeding coefficient will be computed for two loci at distances specified in the `set recomb freqs' statement.
kin example and sample output Under the subdirectory `IBD/', run the example above using the command below. To send the output to a file instead of the screen, include `> filename' (without quotes) after the parameter file name on the command line.
kin jv_rep_kin.par |
Below is the relevant part of the kin output.
Component 1:
Kinship coefficients:
531 431 .32031
431 432 .10938
2-locus inbreeding coefficients:
(g4link is probability of IBD at both of 2 linked loci)
proband recomb g4link
freq prob
531 .000 .10938
.010 .10234
.040 .08386
.050 .07849
.100 .05660
.180 .03455
.300 .01910
.500 .01196
Component 2:
Inbreeding coefficients:
441 .06250
541 .10938
2-locus inbreeding coefficients:
(g4link is probability of IBD at both of 2 linked loci)
proband recomb g4link
freq prob
541 .000 .10938
.010 .10234
.040 .08386
.050 .07849
.100 .05660
.180 .03455
.300 .01910
.500 .01196
|
Note that when the recombination frequency is 0.0, the two-locus inbreeding coefficient is the same as the one-locus inbreeding coefficient, as there is no recombination between the loci, thus they act as a single locus. When the recombination frequency is 0.5, the two loci are independent and the two-locus inbreeding coefficient is the squared one-locus inbreeding coefficient.
kin statements At least one of the following `compute ...' statements are required
to run program kin. If there is more than one component
(connected pedigree) in the file, the component number must be
specified. MORGAN assigns component numbers to the connected
pedigrees within the pedigree file. If your data set contains more than
one component, you may first run pedcheck to determine which
individuals are assigned which component numbers. pedcheck will
sort by component number in the output pedigree file, although it will
not list component numbers in the file. The screen output generated
when running pedcheck will give component nubmers and the number
of individuals in each component. Check the error
and warning messages when running kin to verify that component
numbers were correctly specified. The program will quit if an
individual's component number is incorrectly specified in the parameter
file or if there is more than one component in the data set and no
component is specified.
compute [component M] kinship coefficient N1 N2...This statement names one or more pairs of pedigree members for which the kinship coefficient is to be computed.
compute [component M] inbreeding coefficient N1...This statement names one or more pedigree members for whom the inbreeding coefficient is to be computed.
compute [component M] two-locus inbreeding coefficient N1...This statement requests the computation of two-locus inbreeding coefficients, i.e. the probability of ibd at both loci, for the named individual. For the recombination frequencies at which the coefficients are computed, see the following statement.
set recombination frequencies X1 X2...Two-locus inbreeding coefficients are computed for each of the list of recombination frequencies, in the range of 0.0 to 0.5. If frequencies are not given, the default values are: 0.0, 0.01, 0.04, 0.05, 0.10, 0.18, 0.30, and 0.50.
5.1 Introduction to genedrop. | ||
5.2 Sample genedrop parameter file | ||
5.3 Running genedrop examples and sample output | ||
5.4 genedrop statements |
genedrop. genedrop simulates pedigree data for analysis by other programs.
Given a genetic map, it simulates genotypes at marker loci (linked or
unlinked) and the discrete genotypes and polygenic values contributing
to quantitative traits. The trait loci may or may not be linked to
marker maps. Thus, one or more of three kinds of loci are simulated on a
chromosome: markers, traits linked to markers, and traits not linked to markers.
Using a random number generator, genedrop first assigns marker
and trait genotypes and polygenic trait values to the founders. Then
meiosis indicators are simulated for non-founders in chronological
order, thus determining the genes inherited and the founder labels. The
order of simulation of markers and traits for each individual is: first,
the marker genes, if any, are simulated, in the order mapped on the
chromosome. Then linked traits, if any, are simulated in map order.
Finally, unlinked traits, if any, are simulated.
Because founders of a pedigree are assumed to be unrelated, a unique identifier, a founder gene label or founder label, is assigned to each of the two haploid genomes of each founder. The user may choose to include these founder labels with the output pedigree, identifying the ancestral source of each gene at each locus in non-founders.
The user may provide random number seeds for both the marker simulation and the trait simulation. This permits multiple simulations, for a pedigree, of identical marker genotypes, but with different quantitative trait values.
The population and segregation model parameters (allele frequencies, trait genotype means, additive and residual variances) may be specified by the user and take default values if not specified. Several different trait models can be specified as in the following table:
| Equal Genotypic Means | Zero Additive Variance |
non-genetic model | YES | YES |
polygenic model | YES | NO |
major gene model | NO | YES |
mixed model | NO | NO |
The trait locus must be diallelic and the trait residual variance must be greater than zero. A very small residual variance can be specified if one desires to simulate a qualitative trait.
Genetic data on all individuals may be included in the simulated pedigree, or some individuals may be specified as "missing". If any individuals are to be missing genetic data, an "observed" indicator column must be included in the pedigree file. See section Pedigree file, for details.
genedrop parameter file Files for genedrop may be found in the `Simulation' subdirectory of
`MORGAN_Examples'.
The example here refers to `ped73_gdrop.par'.
The seed file can be accessed as through the command line or in the
parameter file. If from the parameter file following statements are needed:
input seed file 'marker.seed' output marker seeds only output overwrite seed file 'marker.seed' |
The above statements specify `marker.seed' as both the input and output seed file. The statement `output marker seeds only' causes only the marker seeds to be saved before exiting the program. The default is to save both marker and trait seeds. Also, the `overwrite' option in line 4 enables the program to replace the current content in the seed file with the newly generated random numbers, which may be used for simulation in the future. The seed file contains one or more statements like `set marker seeds 0xde5e8d39 0x8ec584dd'. In the example, we have chosen to access the seed file from the command line, hence the first and third lines in the above example are commented out in `ped73_gdrop.par'. See the command line used to simulate data in the next section.
With the statement `output pedigree chronological', the output pedigree will be in chronological order, a requirement for other MORGAN programs to run properly. In addition, a file name must be specified for the output pedigree file. This can be done here in the parameter file using an `output pedigree file' statement or it can be specified on the command line, as shown in the next section.
simulate chrom 1 markers 10 traits 1 |
The above statement asks genedrop to simulate ten markers loci and
one trait locus on chromosome 1. If no trait locus is to be simulated,
the part `traits 1' can be removed.
map chrom 1 marker dist 10 10 10 10 10 10 10 10 10 |
The above statements indicates marker maps on chromosome 1, as specified by map distances (in Haldane centiMorgans). They can also be specified by recombination fractions, e.g. (does not correspond to the above map distances):
map chrom 1 marker recomb fracs 0.1 0.5 0.2 |
Marker allele frequencies are set by the following lines:
set chrom 1 markers 1 freqs 0.13 0.66 0.16 0.05 set chrom 1 markers 2 freqs 0.06 0.23 0.41 0.25 0.05 set chrom 1 markers 3 freqs 0.11 0.02 0.01 0.06 0.24 0.56 set chrom 1 markers 4 freqs 0.07 0.04 0.89 set chrom 1 markers 5 freqs 0.12 0.11 0.03 0.03 0.50 0.21 set chrom 1 markers 6 freqs 0.50 0.44 0.06 set chrom 1 markers 7 freqs 0.01 0.33 0.62 0.04 set chrom 1 markers 8 freqs 0.20 0.05 0.42 0.27 0.06 set chrom 1 markers 9 freqs 0.18 0.18 0.25 0.16 0.08 0.15 set chrom 1 markers 10 freqs 0.17 0.35 0.04 0.29 0.15 |
In the case where several markers have the same number of alleles and allele frequencies, one can group those markers together into one line such as
set chrom 1 markers 11 12 13 15 freqs 0.2 0.8 |
although we consider it good practice to specify the frequencies separately for each marker.
The following five lines describe the trait locus. The trait locus is between markers 2 and 3 on chromosome 5, with recombination fraction 0.1127 to marker 2. The trait locus can have only two alleles; here the frequencies are 0.95 and 0.05, for alleles 1 and 2, respectively. The mean values of the trait for each trait locus genotype are on the next line. Values correspond to the (1 1), (1 2) and (2 2) genotypes, respectively. A non-zero residual variance indicates that there is an environmental component to the trait.
map chrom 1 trait 1 marker 5 dist 5 set trait 1 freqs 0.5 0.5 set trait 1 geno means 90 100 110 set trait 1 residual variance 25.0 set trait 1 additive variance 0.0 |
The final three lines of the parameter file request that founder gene (or genome) labels and latent variable values for the trait be included in the output file, and data be simulated for all (observed and unobserved) individuals. Founder gene labels indicate, for all non-founders, which founder alleles were passed to the individual. The latent trait variables mentioned in the final statement are the trait locus genotypes and the additive and residual components of the trait value. They will precede the trait value in the output file.
output pedigree record founder gene labels output pedigree record trait latent variables output pedigree record unobserved variables |
genedrop examples and sample output Two examples are available under the subdirectory `Simulation/'. The commands to run these examples are similar to the following:
genedrop ped73_gdrop.par ped ../ped73.ped seed ../marker.seed oped ../ped73_gdrop_simu.oped |
Upon running the genedrop example, notice (but do not worry
about) the appearance of a warning message on the screen:
`Overrides previous statement of same type (W)'. The warning was
triggered by the specification of `seeds' as both the input and
output seed file, without inclusion of `overwrite' in the
`output seed file' statement. When an overwrite is not requested,
MORGAN appends the new output seeds to the existing file at
the end of the run. Thus, at the next run, more than one `set
marker seeds' statement exists in the seed file. The program uses only
the last `set marker seeds' statement in the file and issues the
warning to indicate this to the user. To avoid this
warning (and an ever-growing seed file), one can access the seed file
from the parameter file as at the beginning of the previous section
Sample genedrop parameter file, using the `overwrite' option
when outputting the seeds.
Unlike other MORGAN programs, genedrop always creates
an output pedigree file, as the function of the program is to simulate
marker and trait data. The output file, `ped73_gdrop_simu.dat' here,
has the same structure as the input file `ped73.ped', with one
individual per record (line), but the output
file does not include the parameter
statements at the top of the input file and contains additional columns.
The first four items are
the individual's name, the names of the parents and gender. If no addition
output options are set, the next items are the genotypes of the markers (two
items per marker) in their order on the chromosomes, and the trait
values in the order of trait labels.
Since `output pedigree record trait latent variables' was included in the parameter file, the output file contains four additional columns preceding the trait value. They are: trait locus genotype (2 columns), additive component of the trait value and residual component of the trait value. In this example, everyone has a `0.000' in the additive component column, as we set the additive variance to zero in the parameter file.
Because `output pedigree record founder gene labels' is set, the founder gene lables (FGLs) for markers precede the marker genotypes and the trait FGLs precede the trait locus genotypes (only if latent variables are requested, as in this example).
Also, the `output pedigree record unobserved variables' statement is included in `gdrop.par', so an observed indicator would follow gender in the output pedigree file.
genedrop statements genedrop computing requests simulate [chromosome I] markers J [traits K]One statement is given for each chromosome on which markers or both markers and traits are to be simulated. Only unlinked traits are simulated if no such statement is given. The `chromosome' keyword can be omitted if all markers and linked traits are on the same chromosome. The markers and traits are labeled as positive integers. The marker labels are relative to each chromosome whereas the trait labels are absolute. That is, the markers on chromosome 5 are labeled 1, 2, …, and so are the markers on chromosome 7. The traits however are labeled, for example, as 1 on chromosome 5 and as 2 on chromosome 7. There can be at most 10 traits and 500 markers.
simulate unlinked traits K1... Optional. This statement specifies traits to be simulated which are not linked to markers, and hence, have no map specifications.
genedrop mapping model parameters map [chromosome I] [gender (F | M)] marker ( [Kosambi] distances | recombination fractions | [Kosambi] positions) X1 X2...This statement is required if simulation of more than one marker is requested. One statement is used per chromosome. This statement specifies the marker map or positions given in units of genetic distances (cM), or recombination fractions between markers. Marker map or positions can be sex-specific if gender is included in the statement. If `distances' is chosen, intermarker distances must be provided such that the number of distances is one less than the number of markers. If `positions' is chosen, the number of positions must equal the number of markers, as these are absolute positions relative to a zero point to the left of all of the markers. The Haldane mapping function is used to convert between the genetic distances and recombination fractions unless Kosambi is specified.
map [chromosome I] [gender (F | M)] traits K1 K2 ... markers J1 J2 ... ( [Kosambi] distances | recombination fractions) X1 X2...This statement is required if simulated traits are to be linked to markers. That is, it is not required if no traits or only unlinked traits are to be simulated. The statement specifies the location of each trait locus with respect to one of the marker loci. Thus, the number of traits listed in the statement must be equal to the number of markers listed and to the number of distances (or recombination fractions) listed. The trait locus will follow the corresponding marker locus (to the right, so to speak) at the distance specified. To simulate a trait locus that precedes all marker loci, list marker `0' in the statement. For example, with `map trait 3 2 marker 6 0 distances 5 4', traits 3 and 2 will be placed 5 cM to the right of marker 6 and 4 cM to the left of marker 1, respectively.
genedrop population model parameters set [chromosome I] markers K1... frequencies X1 X2...This statement specifies markers allele frequencies. Allele frequencies for a marker should sum to between 0.9999 and 1.0001. Otherwise they are normalized. Multiple markers can be specified in a single statement if they reside on the same chromosome and have the same number of alleles with the same allele frequencies.
set traits K1... frequencies X1 X2This statement specifies the trait allele frequencies. Allele frequencies for a trait should sum to between 0.9999 and 1.0001. Otherwise they are normalized. Multiple trait loci can be specified in a single statement if they have the same allele frequencies. Trait loci must be biallelic.
set normalized frequencies If the set of allele frequencies for each marker and trait is to be normalized, this statement is given. Normalization of the frequencies is recommended when simulating pedigree data, but not recommended when using the other programs.
set traits K1 ... genotype means X1 X2 X3Since two alleles are simulated for each trait locus, three means must be specified for the polygenic trait values: one each for the (1 1), the (1 2) or (2 1), and the (2 2) genotypes. The default values 0.0, 0.0, and 0.0.
set traits K1 ... additive variance XHere we specify the genetic variance for one or more trait. One of there statements is given for each value assigned. The default variance is 0.0.
set traits K1 ... residual variance XThis statement is like the preceding one. The environmental contribution to the trait is set using this statement.
genedrop computational parameters set marker seeds H1 H2This statement initializes the seeds for the random number generator in the gene dropping algorithms. The seeds are to be positive and no greater than hexadecimal 0xFFFFFFFF, with the first seed (congruential seed) odd, and the second seed (Tausworthe seed) nonzero. If no seeds are specified for marker simulation, default seeds are used.
set trait seeds H1 H2This statement initializes the seeds for trait simulation. If no seeds are given, the starting seeds for trait simulation are the seeds returned by the random number generator at completion of marker simulation.
genedrop output pedigree options output pedigree record founder gene labels When this option is selected, each record contains, for each locus, a pair of founder gene labels. Each founder is assigned a pair of labels, which are in the same order as the names of the parents. Then, for each locus for each descendant, founder labels are determined by the simulated meiosis indicators.
This statement is useful in cases where we want the "disease" allele in the pedigree to come from a particular founder.
output pedigree record trait latent variablesThis statement requests that the quantitative trait value be output for each trait. With this option, additional items appear in the record: the genotype at each trait locus and the additive and residual component of each quantitative trait.
output pedigree record unobserved variablesWhen this option is not selected, unobserved individuals have the genotype at each locus represented as `0 0', the founder gene label (if requested) at each locus represented as `0 0', and each quantitative trait value given as `999'.
If this option is set, then genotypes, gene labels and trait values are output for unobserved individuals, as well as for the observed members of the pedigree. An additional data field, following the gender indicator, specifies whether or not the individual is observed. `0' indicates unobserved, `1', observed.
input pedigree record observed (absent | present)The observed indicator is used to designate which members are observed, with '0' indicating unobserved, '1' indicating observed. When the observed indicator is present in the pedigree file, it follows gender (or parents, if gender is not present). If this statement is not given, all pedigree members are assumed to be observed. See also the next statement `assume all observed'.
If individuals are flagged, in the pedigree file, as unobserved, the default is to indicate, in the output pedigree file, the data for these individuals as missing.
assume all observedThis statement results in all members of the pedigree "observed" in the simulation, with the observed indicator, if present in the input file, ignored.
genedrop output seed file options output (marker | trait) seeds onlyIf an output seed file is given, both ending marker and trait seeds are saved unless one or the other is requested in this statement.
markerdrop markerdrop simulates marker data at markers linked to a
potential trait locus. The user must specify whether marker data
simulation is to be conditional on a trait model or on an inheritance
pattern at the trait locus. The choice of a trait model or an
inheritance pattern will dictate which additional parameter statements
must (or may) be included in the parameter file.
markerdrop statements. There
must be only one mapping statement for the trait; from this statement
the trait number (name) is deduced. Phenotypic trait data are
provided as affection status of each individual in the pedigree file.
An inheritance pattern at the trait locus is simulated from the trait
data; this becomes the trait model on which markers are simulated.
markerdrop computing requests and Using MCMC to Estimate Parameters of Interest in Pedigree Data. Location of
inheritance indicators in the pedigree file can be specified using the
`input pedigree record' statement. Again, specification of a map
position for the trait locus using a `map' statement is required.
markerdrop parameter file - conditional on trait Files for markerdrop may be found in the `Simulation'
subdirectory of
`MORGAN_Examples'.
The sample parameter file, `ped73_mdrop_trait.par', requests simulation of marker data
conditional on a trait model. The trait is assumed to be discrete when
simulation is conditional on a trait model. The relevant section of the file is:
map trait 2 marker 5 distance 5.0 simulate markers 10 using trait map marker positions 10 20 30 40 50 60 70 80 90 100 set incomplete penetrances .05 .6 .95 set trait 2 freqs 0.5 0.5 set markers 1 freqs 0.13 0.66 0.16 0.05 set markers 2 freqs 0.06 0.23 0.41 0.25 0.05 . . . set markers 10 data 101 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 201 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 202 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 301 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 302 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 304 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . |
The first three lines of each of the parameter files above are required
for markerdrop, regardless of simulation of marker data on a trait or on an inheritance
pattern. The `map trait' statement identifies the trait locus to be used in the
simulation and gives its position relative to the markers on which we wish to
simulate data. In this example, the trait locus follows marker 5
at a distance of 5 centiMorgans. The `simulate markers 10 using trait' statement states
the number of markers to be simulated and specifies simulation using a
trait model. The `map marker positions' statement specifies the spacing of
the markers to be simulated.
Compared with the parameter file needed for simulation conditional on inheritance pattern (see next section), `ped73_mdrop_trait.par' contains two additional lines (lines 4 and 5), which are required for discrete traits (the default for simulation on a trait). `set incomplete penetrances' specifies the probability of exhibiting the trait for individuals with trait locus genotypes `1 1', `1 2' (or `2 1') and `2 2', respectively. `set trait ... freqs' specifies trait allele frequencies.
Starting from line 6, the statements specify allele frequencies at the first two markers and are required.
Following the statement `set markers 10 data', patterns of marker data to be simulated are given for each individual. Ten markers correspond to 10 pairs of 0's or 1's, with 0's representing "unobserved" and 1's "observed".
File `ped73_mdrop_trait.par' uses pedigree file `ped73_ph.ped'. The file format section and first few lines of the pedigree data setion of this file are below.
input pedigree size 73 input pedigree record names 3 integers 3 *************************************************** 101 0 0 1 0 0 102 0 0 2 0 0 201 101 102 1 0 0 202 101 102 2 0 0 2010 0 0 2 0 0 301 201 2010 1 0 0 302 201 2010 2 1 1 |
The first three columns are indices of each individual and his/her parents, whereas the last three columns are sex (1=male, 2=female), observed status (0=unobserved, 1=observed) and affectation status (0=missing, 1=unaffected, 2=affected). The affectation status is actually specified through a statement in the parameter file:
input pedigree record trait 2 integer 3 |
This statement can be included in the pedigree file instead. But markerdrop
can simulate data for markers linked to only one trait locus, as specified in the
`map' statement in the parameter file.
markerdrop parameter file - conditional on inheritance pattern The sample parameter file, `ped73_mdrop_inhe.par', requests simulation of marker data conditional on an inheritance pattern. The relevant section of the file is:
map trait 3 marker 4 recomb frac 0.01 simulate markers 10 using inheritance map marker positions 10 20 30 40 50 60 70 80 90 100 set markers 1 freqs 0.13 0.66 0.16 0.05 set markers 2 freqs 0.06 0.23 0.41 0.25 0.05 . . . set markers 10 data 101 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 201 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 202 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 301 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 302 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 304 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . |
Again, the first three lines of each of the parameter files above are required
for markerdrop. The `map trait' statement identifies the trait locus to be used in the
simulation and gives its position relative to the markers on which we wish to
simulate data. In this example, the trait locus
follows marker 4 with a recombination fraction of 0.01.
The `simulate' statement states the number of markers to be simulated
and specifies simulation using the inheritance
pattern. The `map marker positions' statement specifies the spacing of the markers to be simulated.
Lines 4 and 5 specify allele frequencies at the first
two of the ten markers. Statement `set markers 10 data' and lines below
specify the missing data pattern with 0 for missing and 1 for observed.
File `ped73_mdrop_inhe.par' uses pedigree file `ped73_mdrop_inhe.ped'. The file format section and first few lines of the pedigree data setion of this file are below.
input pedigree size 73 input pedigree record names 3 integers 4 input pedigree record trait 3 integer pairs 3 4 *************************************************** 101 0 0 1 0 -1 -1 102 0 0 2 0 -1 -1 201 101 102 1 0 0 1 202 101 102 2 0 1 1 |
The first three columns are indices of individuals and their parents. The next two are sex and observation status. The last two columns are inheritance indicators with the first being the paternal ones and the second the maternal ones. A founder's inheritance indicators are `-1 -1'.
markerdrop
can simulate data for markers linked to only one trait locus, as specified in the
`map' statement in the parameter file.
markerdrop examples and sample output To run the two examples under the subdirectory `Simulation/', one may use the following commands:
markerdrop ped73_mdrop_inhe.par |
and
markerdrop ped73_mdrop_trait.par |
Below is a section of `mdrop_inhe.out', the output file generated using the parameter file `ped73_mdrop_inhe.par' and the pedigree file `ped73_mdrop_inhe.ped'. Similar output would be obtained using `ped73_mdrop_trait.par'.
Inter-locus distances in cM, using Haldane map function:
T3
----------------------------+------------------------------------------
10.0 10.0 10.0 1.0 9.0 10.0 10.0 10.0 10.0 10.0
+------+------+------+-------------+------+------+------+------+------+
M1 M2 M3 M4 M5 M6 M7 M8 M9 M10
assigned FGL in all listed individuals:
trait locus, followed by 10 marker loci
101 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1 2 1
102 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3
201 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3
202 2 4 2 3 2 3 2 3 2 4 2 4 2 4 2 4 2 4 2 4 2 4
2010 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5 6 5
301 2 6 2 5 2 6 2 6 2 6 2 6 2 5 2 5 2 5 2 5 2 5
302 2 6 3 6 2 6 2 6 2 6 2 5 2 5 2 5 2 6 2 6 2 6
assigned marker genotypes in accordance with data availability:
101 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
102 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
201 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
202 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2010 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
301 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
302 4 2 4 3 5 6 3 3 6 1 1 2 3 3 1 4 1 6 1 4
|
In the output file above, the marker map is shown, as specified in the parameter file. Below the map, founder genome labels (FGL) are listed. In this section of the pedigree, individuals 101, 102 and 1020 are founders. They have been assigned two unique FGL each. One of each founder's FGL has been randomly selected to be passed to their offspring. Using the FGL, marker genotypes have been assigned to individuals on whom data were specified as available in the parameter file, individual 302 for example.
markerdrop statements markerdrop computing requests markerdrop requires one of the following two statements:
simulate markers J using traitThis statement requests the simulation of J markers conditional on a trait model. If marker data are simulated conditional on a trait model, the user must specify trait allele frequencies, genotypic penetrances and a map position for the trait locus within the parameter file. Affection status of each individual must be specified in the pedigree file following gender, if present.
simulate markers J using inheritanceThis statement requests the simulation of J markers conditional on an inheritance pattern at the trait locus. If marker data are to be simulated conditional on an inheritance pattern, the user must specify a map position for the trait within the parameter file. In addition, a pair of inheritance indicators for each individual must be included in the pedigree file following gender, if present. The first of the pair describes paternal inheritance at the trait locus and the second describes maternal inheritance. Inheritance indicators are coded as `0', `1' or `-1', corresponding to segregation of the trait allele from the individual's grandmother, grandfather, or unknown, respectively. For example, `0 0' indicates that the individual inherited the alleles carried by both grandmothers at the trait locus, while `0 1' indicates inheritance of the paternal grandmother's and maternal grandfather's alleles.
markerdrop mapping model parameters map [gender (F | M)] marker ( [Kosambi] distances | recombination fractions | [Kosambi] positions) X1 X2 …This statement is required for markerdrop if more than one marker
is to be simulated. It specifies the marker map (optionally a
sex-specific map), in units of genetic distance (cM), marker position
(cM), or recombination fraction. If distance is selected,
markerdrop will expect one fewer values than the number of
markers, as these are intermarker distances. If position is expected,
the same number of values as markers will be expected, as these are the
positions of the markers relative to some zero point to the left of
marker 1. If Kosambi is not specified, the Haldane mapping function is
used to convert between genetic distance and recombination fraction.
map [gender (F | M)] trait K marker J ( [Kosambi] distance | recombination fraction ) XThis statement is required for markerdrop; it tells the program
which trait to use in the simulation of marker data and gives a location
for the trait locus, either as a map distance or recombination fraction,
following the marker listed in the statement. As with genedrop,
to simulate a trait locus position that precedes all markers, list the
marker number as `0'.
markerdrop population model parameters set trait K1 frequencies X1 X2This statement specifies trait allele frequencies.
Trait must be biallelic; both allele frequencies must be
listed and must sum to a value between 0.9999 and 1.0001. Otherwise
markerdrop automatically normalizes the allele frequencies
and issues a warning. Only one
trait may be included in this statement.
set marker names N1 N2 …This statement specifies marker names in the order of their position along the chromosome. Default names are marker-1, marker-2, etc.
set markers J1 … frequencies X1 X2 …Marker allele frequencies are specified using this statement. A marker can have up to 100 alleles and all allele frequencies must be listed. For each marker, the allele frequencies should sum to between 0.9999 and 1.0001. Otherwise they are automatically normalized and a warning message will be issued. Multiple markers can be included in a single statement if they have the same number of alleles with the same frequencies.
markerdrop computational parameters set incomplete penetrancesThis statement is required for markerdrop when using a trait
model or when using inheritance indicators with a discrete trait. A
penetrance, the probability of expressing the trait given a particular
trait locus genotype, must be specified for each of the 3 possible
genotypes at the trait locus. For example `set incomplete
penetrances 0.15 0.85 0.99' specifies that the probability of expressing
the trait is 0.15, 0.85 and 0.99 for (1,1), (1,2) and (2,2) trait locus
genotypes, respectively.
set trait data discreteThis statement is optional. A discrete trait is the default when simulating using a trait model.
As with genedrop, marker seeds and trait seeds can be specified
or the default values can be used, See section genedrop computational parameters.
markerdrop input file options The statements below are optional for markerdrop; they are used
to indicate a change from the default order of trait values in the
pedigree file. The first statement may be included if marker data are
to be simulated conditional on a trait model and the second may be
included if data are to be simulated conditional on an inheritance
pattern.
input pedigree record traits K1 K2 K3 ... integers I1 I2 I3 ...Unless this statement is present, the first integer following gender, if present, is assumed to be data for trait 1, the next integer for trait 2, and so on. Use this statement to specify an alternate correspondence between integer values in the record and trait numbers.
input pedigree record traits K1 K2 ... integer pairs I11 I12 I21 I22 ...Unless this statement is present, the first two integers following gender, if present, in the pedigree file are assumed to be the inheritance indicators at the locus for trait 1. The next two integers are assumed to be the inheritance indicators at the locus for trait 2, and so on.
7.1 Introduction to ibddrop | ||
7.2 Sample ibddrop parameter file | ||
7.3 Running ibddrop example and sample output | ||
7.4 ibddrop statements |
ibddrop ibddrop estimates probabilities of gene identity by descent,
ibd, (such as kinship, inbreeding, or multi-gene identities) by
Monte Carlo in the absence of data. Given the pedigree and a genetic
map, ibddrop simulates meioses indicators and scores them to
estimate the ibd probabilities among a set of gametes.
The simplest example of estimation of ibd probabilities among a set of gametes is the computation of an individual's inbreeding coefficient. In this example, the set of gametes in question are the maternal and paternal gametes that make up the individual. A set of two gametes can be either ibd or not-ibd. To keep track of ibd status among the gametes, we can label the paternal allele `1'. If the two alleles are ibd, the maternal allele would also be labeled `1', and the resulting ibd pattern would be `1 1'. If the two alleles are not ibd, the maternal allele would be labeled `2' and the resulting pattern would be `2 2'. The individual's inbreeding coefficient is the probability that the two alleles follow the `1 1' pattern.
If there are three gametes in the set, there are five potential ibd
patterns: `1 1 1' (all three gametes are ibd), `1 1 2'
(the first two are ibd and the third is not), `1 2 1' (the
first and third are ibd) , `1 2 2' (the last two are ibd),
and `1 2 3' (none are ibd). ibddrop can estimate
probabilities of ibd patterns among up to 10 gametes in a set.
ibddrop outputs a probability for each ibd pattern
at each marker.
Gene identity can be scored either for each locus separately, in which
patterns of identity among up to ten haplotypes can be scored, or it can
be scored jointly over a moving window of several loci. If the moving
window option is selected, genedrop
calculates the probability that all of the gametes in the set are
ibd at each locus in the window. As a result, it is then possible
to determine the probability that all or some of the gametes are ibd for a
particular haplotype.
ibddrop parameter file Files for ibddrop may be found in the `IBD' subdirectory of
`MORGAN_Examples'.
There are two sample parameter files for ibddrop; let's take a look at
`jv_rep_ibd.par' first.
input pedigree file "jv_rep.ped" simulate markers 5 trait 1 map markers distances 44.6 44.6 11.2 11.2 map trait 1 marker 2 distances 22.3 set component 1 proband gametes 331 0 333 1 set component 2 proband gametes 541 0 541 1 341 0 343 1 input seed file '../sampler.seed' set MC iterations 20000 |
The parameter file specifies the pedigree file name `jv_rep.ped' and then asks for five markers and one trait locus. Since there are no data, the distinction between marker and trait doesn't mean anything - it is just a way to specify a set of loci, one of which may be unlinked. `jv_rep.ped' contains data on 30 individuals, including gender and one trait.
The two `map' statements specify the genetic map. From the first statement, the genetic distances between the markers are 44.6, 44.6, 11.2 and 11.2 centiMorgans. From the second statement, the trait lies between markers 2 and 3, at 22.3 centiMorgans with marker 2.
The `set proband gametes' statements tell ibddrop which
gametes to score. Here we selected, from component 1 (the first family
in the data set), the maternal (0) gamete of `331' and the paternal (1)
gamete of `333'. The next statement selected four gametes to score
from family 2. Note that characters are allowed in the names of individuals.
The `input seed file' statement enables the file to use the seeds from
file `sampler.seed'. When the program finishes running, new seeds will
be generated and appended at the end of this file. Seeds can also be set
using the `set sampler seeds' statement (see ibddrop statements).
The number of Monte Carlo iterations is set to be 20,000 by the `set MC iterations' statement.
Note that if one would like to compute multilocus ibd probability, statement `set locus window' can be used to specify number of loci to score simultaneously.
ibddrop example and sample output Under the subdirectory `IBD/', run the example with the command:
ibddrop jv_rep_ibd.par |
The genetic map specified by the statements `map markers distances' and `map trait 1 marker 2 distances' is below. Note the position of the trait locus (T1) with respect to the marker loci.
Distances (cM):
T1
--------------+---------------------
44.6 22.3 22.3 11.2 11.2
+------+-------------+------+------+
M1 M2 M3 M4 M5
|
Since the parameter file contains two `set proband gametes'
statements, ibddrop will produce two sets of results. The
results for the second component are detailed below.
Summary for component 2:
Probabilities of IBD patterns
Proband gamete set 1: 541 0 541 1 341 0 343 1
pattern marker-1 marker-2 trait-1 marker-3 marker-4 marker-5 label
1 1 1 1 .0322 .0312 .0289 .0301 .0294 .0306 0
1 1 1 2 .0291 .0294 .0295 .0284 .0299 .0295 1
1 1 2 1 .0129 .0143 .0147 .0140 .0142 .0134 3
1 1 2 2 .0103 .0098 .0097 .0097 .0089 .0092 4
1 1 2 3 .0273 .0280 .0263 .0269 .0276 .0267 5
1 2 1 1 .0654 .0638 .0634 .0652 .0633 .0649 6
1 2 1 2 .0060 .0063 .0057 .0062 .0062 .0058 7
1 2 1 3 .0571 .0576 .0581 .0601 .0602 .0584 8
1 2 2 1 .0692 .0680 .0661 .0685 .0703 .0728 9
1 2 2 2 .0483 .0527 .0488 .0493 .0486 .0485 10
1 2 2 3 .1389 .1405 .1386 .1384 .1371 .1326 11
1 2 3 1 .1348 .1366 .1354 .1379 .1369 .1391 12
1 2 3 2 .0267 .0250 .0259 .0266 .0267 .0285 13
1 2 3 3 .0949 .0945 .0980 .0949 .0968 .0975 14
1 2 3 4 .2469 .2424 .2508 .2437 .2442 .2427 15
|
The probabilities are summarized by the ibd pattern. Each integer
in the pattern represents one of the gametes that ibddrop was
asked to score. Same numbers indicate gametes that are ibd. For
instance, `1 1 1 1' means all four gametes are ibd; `1 2 1
1' means gametes 1, 3, and 4 are ibd, while gamete 2 is not ibd
with the others; `1 2 3 4' means all four gametes are not ibd.
The ibd patterns are scored for each locus separately; there is a column for each of the five markers and one for the trait.
To compute multilocus ibd probabilities, say, 3 loci, use `set locus window 3' in the parameter file and re-run the example using the same command line. And the interesting part of the output is
Summary for component 2:
Probabilities of IBD patterns for windows of 3 loci
Proband gamete set 1: 541 0 541 1
IBD wndw 1 wndw 2 wndw 3 wndw 4
0 0 0 .7291 .7443 .7657 .7881
0 0 1 .0698 .0655 .0482 .0478
0 1 0 .0640 .0532 .0365 .0266
0 1 1 .0279 .0252 .0369 .0284
1 0 0 .0806 .0696 .0703 .0493
1 0 1 .0087 .0080 .0067 .0049
1 1 0 .0135 .0238 .0177 .0268
1 1 1 .0063 .0105 .0180 .0281
|
This time, ibddrop was asked to compute ibd probabilities in
windows of three loci at a time. This was done using the `set
locus window' statement. Since the trait locus is unlinked to the
marker loci in this example, it is placed to the left of the five marker
loci on the map. Thus, the first window, `wndw 1' in the table
above, includes the trait locus and the first two marker loci,
`wndw 2' includes the first three marker loci, `wndw 3'
includes marker loci 2, 3 and 4, etc. The values in the `ibd'
column at the left of the table, represent `ibd' patterns. The
pattern `0 0 0' means that the selected gametes are not ibd at
the three loci in each window. The pattern `0 0 1' means that the
selected gametes are not ibd at the first two loci in the window,
but are ibd at the third. The values in the columns give the
probability of the ibd pattern at the left for each of the four
windows. For example, the probability that the maternal and paternal
gametes of individual 541 are ibd at marker loci 3 and 5, but not at
marker locus 4 is 0.0049.
ibddrop statements genedrop computing requests).
genedrop mapping model parameters).
set [component M] proband gametes N1 K1 N2 K2...In this statement, the user specifies which gametes ibddrop is to
score. Each statement must contain gametes from a single component, as
the components are assumed to be independent, i.e., the probability of
ibd between gametes from different components is zero. Pairs
consisting of an individual's name and a meiosis indicator are listed,
with `0' indicating the individual's maternal gamete and `1'
indicating their paternal gamete.
In the current version of MORGAN, the number of proband gametes in a set is limited to 10.
set [chromosome I] locus window KThis statement gives the window size (number of loci) for which the multilocus ibd probabilities are scored. If no size is given, each locus is scored separately.
set sampler seeds H1 H2This statement initializes a pair of seeds for the random number generator. The seeds must be positive and no greater than `0xFFFFFFFF', with the first seed (congruential seed) odd, and the second seed (Tausworthe seed) nonzero. If no seeds are specified, default seeds are used.
set MC iterations IRequired. This statement specifies the total number of Monte Carlo iterations.
simulate markers K1 trait 1This statement specifies the number of markers to be simulated, as well as a linked trait. A linked or unlinked trait must be specified.
simulate unlinked trait 1This statement specifies a trait to be simulated that is not linked to markers. Only one trait can be simulated and this trait will be placed to the left of all markers.
| 8.1 Genetic model | ||
| 8.2 LM-sampler | ||
| 8.3 MCMC parameters and options |
For MORGAN programs, genetic relationships between individuals
in a data set are specified in the pedigree file. Individuals at the top
of a pedigree (family), whose parents are unspecified, are the
founders of the pedigree; other individuals are non-founders.
By definition, founders are unrelated to one another. Descent through
the pedigree of genes at marker and trait loci is tracked by meiosis
indicators, also called inheritance indicators. At each locus,
non-founders are assigned two 0/1 meiosis indicators,
representing genes inherited from the individual's father and mother.
The first indicator is coded as `0' if the non-founder inherited
the gene carried by her father's mother and `1' if she inherited
the gene carried by her father's father, i.e., her paternal grandmother
and grandfather, respectively. The second indicator is coded as
`0' if the non-founder inherited the gene carried by her mother's
mother and `1' if she inherited the gene carried by her mother's
father, i.e., her maternal grandmother and grandfather, respectively.
We use the term gene to refer to a segment of DNA that is copied
from parents to offspring, the concept captured by Mendel's term
factor.
The set of all inheritance indicators is notated as S =
, where
| 0 if DNA involved in meiosis i at locus j is the gamete's parent's maternal DNA |
1 if DNA involved in meiosis i at locus j is the gamete's parent's paternal DNA |
is referred to as the inheritance vector at
locus j. It is the list of the inheritance indicators for all
meioses in the sample at a single locus (locus j). The elements of
are independent of one another, as they represent
the grandmaternal or grandpaternal inheritance for each gamete at the
locus and meiosis events are independent of one another.
is the list of the inheritance indicators at all loci for a
single meiosis (i.e., within a single gamete). The elements of
have first-order Markov dependence. That is, the
probability of a `0' or `1' at locus j + 1 is a function
of the the value at locus j and the recombination fraction between
the loci. In other words, the value at locus j + 1 is conditionally
independent of all other loci, given the value at locus j.
If inheritance indicators are known, identity by descent (ibd) is also known. If probabilities can be assigned to patterns of inheritance indicators in a pedigree, the probability that any two gametes in the pedigree are ibd can be computed.
MORGAN's Autozyg and Lodscore programs use MCMC to estimate ibd probabilities and multilocus LOD scores, respectively, in pedigrees. The latent (unobserved) parameters of interest in MCMC estimation of ibd probabilities and LOD scores are the meiosis indicators at marker and/or trait loci for each non-founder in the pedigree. Observed data are trait values and unphased marker genotypes for some or all pedigree members. With unphased genotypes, it may or may not be possible to determine the grandparental source (i.e., the inheritance indicator) of each allele unambiguously. MORGAN uses MCMC to sample meiosis indicators (S) conditional on observed data (Y).
MORGAN implements two different block Gibbs samplers, a locus-
and a meiosis-sampler, for sampling from S conditional on Y.
Each method updates a subset,
, of S conditional on Y
and on the rest of S (
). The difference between the two
methods is the choice of
.
The locus-sampler (or L-sampler) chooses
to be
for some j. In other words, a single locus is
selected and inheritance indicators at that locus are updated based on
the genotype data at all loci and on the current realization of inheritance
indicators at all loci other than j. The MORGAN user can determine
whether a locus is to be selected at random each time or if loci are
taken in a pre-determined random order, as described in the next
section. This method is a modification of the Elston-Stewart algorithm
(Elston RC and Stewart J (1971) Human Heredity 21:523-542) and it can be used whenever single locus pedigree peeling is
possible. If inter-locus recombination fractions are strictly positive,
the L-sampler is irreducible. On the downside, mixing is poor if loci
are tightly linked.
The meiosis-sampler (or M-sampler) chooses
to be
for some i. It is , in a sense, perpendicular to the
L-sampler in that at each iteration a single meiosis is selected and
inheritance indicators for that meiosis are updated conditional on the
genotype data at all loci and the current realization of inheritance
indicators for all other meioses. The M-sampler is a modification of
the Lander-Green algorithm (Lander ES and Green P (1987) PNAS 84:2363-2367) for peeling along a chromosome
using the Baum algorithm (Baum LE (1972) in O. Shea, ed.,
`Inequalities-III; Proceedings of the Third Symposium on Inequalities,
UCLA, 1969', Academic Press, NY pp. 1-8). At each iteration, a single
meiosis is randomly selected or meioses can be updated sequentially. As
with locus selection in the L-sampler, MORGAN allows the user
to choose the meiosis selection The M-sampler mixes well in the presence
of tightly linked loci, but it can perform poorly in large pedigrees
with missing data.
MORGAN's Autozyg and Lodscore programs use a combination of the L- and M-samplers, referred to as the LM-sampler. The user may choose the fraction of updates that are of each type; the default is 20% L-sampler, 80% M-sampler.
For mathematical details on the L-, M- and LM-samplers