[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

12. Population-based inference of IBD

12.1 Introduction to ibd_haplo  
12.2 Sample parameter files for ibd_haplo  
12.3 Running ibd_haplo examples and sample output  
12.4 Population-based IBD inference parameter statements  

See Concept Index for: population-based IBD inference.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

12.1 Introduction to ibd_haplo

See References, for details of the cited papers.

The program ibd_haplo computes conditional probabilities of gene IBD (identity by descent) states, given data for marker loci for a set of proband gametes. Each set is normally of two individuals (4 haplotypes). Proband gametes are specified haplotypes (maternal or paternal) of specified individuals. These are specified as for the lm_auto program: See Sample lm_auto parameter file. Each individual named in the marker data has to have a unique name. There may be missing data. The marker data are read in as genotypes, but these may be analyzed as an ordered or unordered pair of alleles (i.e. phased or unphased). There is also an option for partial phasing, in which segments of chromosome (sets of contiguous markers) are specified as phased.

The program uses a HMM model for the latent IBD states. For two individuals there are 15 such states, although only 9 are distinguishable from unphased genotypes. There are two options for the transition matrix of the HMM latent IBD state, one developed by Thompson [Tho08b], and the other by Chaozhi Zheng [BGZT12]. Given the latent state, the locus-specific genotype probabilities are based on the premise that IBD DNA should be of the same allelic type, and that non-IBD DNA is of independent allelic types, although allowance is made for typing error to eliminate zero emission probabilities. The transition matrices are also modified to eliminate zero transition probabilities. A simple forward-backward HMM computation provides the probabilities for each IBD state at each locus for each set of proband gametes.

The methods and study resuts of this approach are provided in [Tho08b] and [BGZT12]. Note that the data files and software released for [BGZT12] are for an earlier version of ibd_haplo. The version described here includes improvements both in user interface, and also in the way the IBD transitions are implemented. This version was first released for MORGAN V3.1.1.

See Concept Index for: ibd_haplo introduction


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

12.2 Sample parameter files for ibd_haplo

Three sample parameter files for ibd_haplo can be found in the directory `MORGAN_Examples/Haplo'. All three examples are based on examples in the Gold standards for the program. The first two examples (`phased_2011.par' and `unphased_2011.par') use the same data, and score the same sets of 4 gametes, consisting of the maternal and paternal gametes in pairs of individuals. The examples differ only in whether the data are treated as phased haplotypes or unphased genotypes.

Here is the unphased_2011,par parameter file:

 
set printlevel 3
input marker data file "./marker_data"
output overwrite scores file "./qibd_unphased_2011.out"
output overwrite extra file "./ids_unphased_2011.out"
select all markers

select 2011 state transition matrix
select unphased data
set population kinship              0.05
set kinship change rate             0.05
set transition matrix null fraction 0.1
set genotyping error rate 0.01

set scoreset 1  proband gametes  1400003 0 1400003 1 1400009 0 1400009 1
set scoreset 2  proband gametes  1400005 0 1400005 1 1400007 0 1400007 1
set scoreset 3  proband gametes  1400011 0 1400011 1 1400013 0 1400013 1
set scoreset 4  proband gametes  1400015 0 1400015 1 1400017 0 1400017 1

Since ibd_haplo typically runs with a very large number of markers is is advisable to suppress printing of marker map and allel frequencies using the `printlevel' setting. The file specifications, and marker data, are as for previous programs such as lm_auto: See Sample lm_auto parameter file. Note that there are two output files.

The marker data file `marker_data' contains the positions and allele frequencies of 2000 markers and marker genotypes of 8 individuals. In this example all markers are selected.

The second group of statements relate to the ibd_haplo implementation. The `2011' transition matrix is to be used; this is the one described in [BGZT12] and is reommended. The earlier `2009' option of [Tho08b] is retained for backwards compatibility. The data are to be analyzed as unphased genotypes, and there are four numerical parameters of the HMM model. Most importantly these include the `population kinship', which is the mean a priori level of pairwise IBD between any pair of gametes.

The final set of statements specifies four sets of four gametes among which IBD is to be scored. Since the data are to be analyzed as unphased it is required that each set contains both the maternal and paternal gametes of individuals. In this example, the gametes are those of consecutive pairs of individuals in the marker data file, but this is not required.

The second example parameter file `phased_2011.par' differs only in the names of the output file and in the statement `select phased data', which specifies that the data should be treated as phased haplotypes. In this case it is not necessary that a scoreset consistes of both gametes of individuals.

The third example `ten_ss.par' shows the flexibility of scoresets. This example differs in that only a subset of markers are used. The `select all markers' statement is replaced by
 
select markers 511 512 513 514 515 516 517 518 519 520
               521 522 523 524 525 526 527 528 529 530
               531 532 533 534 535 536 537 538 539 540
Additionally, the scoresets are quite varied:
 
set scoreset 61  proband gametes  1400015 0 1400015 1 1400017 0 1400017 1 1400003 0 1400003 1
set scoreset 41  proband gametes  1400003 0 1400003 1 1400009 0 1400009 1
set scoreset 43  proband gametes  1400005 0 1400005 1 1400007 0 1400007 1
set scoreset 51  proband gametes  1400003 0 1400003 1 1400009 0 1400009 1 1400005 0
set scoreset 32  proband gametes  1400003 0 1400003 1 1400009 0
set scoreset 21  proband gametes  1400017 0 1400017 1
set scoreset 42  proband gametes  1400015 0 1400015 1 1400017 0 1400017 1
set scoreset 52  proband gametes  1400003 1 1400005 1 1400007 1 1400009 1 1400011 1
set scoreset 44  proband gametes  1400011 1 1400011 0 1400013 0 1400013 1
set scoreset 31  proband gametes  1400007 0 1400009 1 1400009 0

The scoresets may have arbitrary numerical indicators, and range in size from 2 to 6 gametes. The program will reorder them according to size.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

12.3 Running ibd_haplo examples and sample output

Run the examples in the `Haplo' subdirectory of the `MORGAN-examples' directory with the following command

 
./ibd_haplo  unphased_2011.par > unphased_2011.out
or
./ibd_haplo  phased_2011.par > phased_2011.out
or
./ibd_haplo  ten_ss.par > ten_ss.out

Each example produces two output files in addition to the standard output. The standard output gives little information when `printlevel 3' is used. The proband gamete sets are specified, and the program reports as it analyzes each set. In between, the program does give the prior probability of IBD states for each size of scoreset requested, and the transiton matrix at a distance of 1 centMorgan -- these are mainly for checking purposes.

The two main output files are the `qibd' file and the `ids' file. For the first example, these have been named as `qibd_unphased_2011.out' and `ids_unphased_2011.out' with analogous names for the other examples. (Of course, any names for these files can be specified in the parameter file.) Each file contains only numeric data, so that it can be read easily into R or other programs for further analysis. The `qibd' file is the key output of probabilities of IBD states in each scoreset at each marker, computed conditional on marker data. The `ids' file gives the scoresets.

For the second example the `ids' output file is
 
 1  4 15  1400003 0  1400003 1  1400009 0  1400009 1
 2  4 15  1400005 0  1400005 1  1400007 0  1400007 1
 3  4 15  1400011 0  1400011 1  1400013 0  1400013 1
 4  4 15  1400015 0  1400015 1  1400017 0  1400017 1
That is, there are four scoresets numbered 1 to 4. Eacch consists of 4 gametes. Since the data are analyzed as phased, there are 15 IBD states for each scoreset, and these gametes are specified in the usual format: `0' for a maternal gamete and `1' for a paternal, of the individual whose name ID is given.

The corresponding `qibd' output file consists of 8000 lines, each of two integers, follwed by 16 real numbers. The first line starts
 
 1     1   14.879888   0.0006  0.0756  0.0225  0.0011  0.8072  0.0001 ....
while line 3108 starts
 
 2  1108   32.534635   0.0000  0.0000  0.0000  0.0000  0.0001  0.0000  ...
The first item indicates the scoreset, the second the marker number, and the third the centiMorgan (or Mbp) position of the marker. The remaining 15 numbers are the probabilities of the 15 IBD states. In the first line, most of the probability (0.8072) is in state-5, which is the state `1123'. That is the first individual's two gametes are likely to be IBD at this first locus, but there is not IBD sharing between the individuals. In the second example, these is a very small probability (0.0001) in this same state, but alsomost all the probability is in the no-IBD state (state-15; 1234; probability 0.9998, not shown). For sets of 4 gametes, we use the traditional ordering of the 15 IBD states or 9 reduced genotypic states:
 
The order of the 15 states is 1111, 1122, 1112, 1121, 1123, 1211, 1222,
1233, 1212, 1221, 1213, 1231, 1223, 1232, 1234.

For the nine reduced states, the order is the same,
but genotypically equivalent ones are
combined:  1111, 1122, 1112+1121, 1123, 1211+1222,
1233, 1212+1221, 1213+1231+1223+1232, 1234.
For more general gamete sets, the ordering is lexicographic.

For more on the specification of IBD states see Sample lm_auto parameter file.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

12.4 Population-based IBD inference parameter statements

The following statements are specific to ibd_haplo, or have a particular role in this program:

select ([partially] phased, unphased) data

The "select ... data" statement is used to inform "ibd_haplo" whether to handle the data as phased data, unphased data or partially phased data. If the data are phased there is no restriction on whether proband gametes are related to each other or not. If the data are unphased or partially phased it is necessary that the proband gametes are pairs of haplotypes, each pair belonging to a whole individual.

select [2009 | 2011] state transition matrix

There are two different state transition matrices implemented in the ibd_haplo program. The user must specify which transition matrix to use for the analysis: see Sample parameter files for ibd_haplo.

set transition matrix null fraction X

This statement sets a parameter that modifies the transition matrix to allow for transitions that can not occur under the base transition matrices. The argument, X, is a real number greater than or equal to 0.0 and less than or equal to 1.0.

set genotyping error rate E

This statement sets the genotyping error rate to be used by "ibd_haplo". The value of R is a real number greater than or equal to 0.0 and less than 1.0. The value 0.01 would be a typical value for R.

set population kinship X

This statement sets the prior population kinship parameter to be used by "ibd_haplo" to X, where X is a real number greater than 0.0 and less than 1.0. Typically in small populations a value from 0.01 to 0.05 might be reasonable.

set kinship change rate X

This statement sets the kinship change rate parameter for IBD. This is the total change rate per centiMorgan. It should be a real number greater than 0.0. It is approximately the prior for the inverse of an IBD segment length in centiMorgans between any pair of haplotypes. However, a smaller value than the typical expected length generally works better.

set [scoreset N] proband gametes N1 K1 N2 K2 ...

One or more scoring sets may be given, where a scoring set consists of two or more haplotypes. If there is more than one set, each set is assigned a number 1 or greater. The maximum number of haplotypes in each set is limited to 10, due to computer memory considerations. Pairs of names and meiosis indicators are given, with 0 indicating maternal inheritance, 1 indicating paternal inheritance. At least one proband gametes score set must be specified when running ibd_haplo.


[ << ] [ >> ]           [Top] [Contents] [Index] [ ? ]

This document was generated by Elizabeth Thompson on July, 7 2013 using texi2html