8. Using MCMC to Estimate Parameters of Interest in Pedigree Data

See Concept Index for: MCMC introduction, Markov chain Monte Carlo.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.1 Specifying inheritance

See References, for details of the cited papers.

For MORGAN programs, genetic relationships between individuals in a data set are specified in the pedigree file. Individuals at the top of a pedigree (family), whose parents are unspecified, are the founders of the pedigree; other individuals are non-founders. In pedigrees, identity by descent is defined relative to the founders of the pedigree, so that, by definition, founders are unrelated to one another. Descent through the pedigree of genes at marker and trait loci is tracked by meiosis indicators, also known as inheritance indicators or segregation indicators [Don83]. At each locus, non-founders are assigned two 0/1 meiosis indicators, representing genes inherited from the individual’s father and mother. The first indicator is coded as ‘0’ if the non-founder inherited the gene carried by her father’s mother and ‘1’ if she inherited the gene carried by her father’s father, i.e. her paternal grandmother and grandfather, respectively. The second indicator is coded as ‘0’ if the non-founder inherited the gene carried by her mother’s mother and ‘1’ if she inherited the gene carried by her mother’s father, i.e., her maternal grandmother and grandfather, respectively. We use the term gene to refer to a segment of DNA that is copied from parents to offspring, the concept captured by Mendel’s term factor.

The set of all meiosis indicators is denoted S = ( Sij ) where

Sij =	0 if DNA involved in meiosis i at locus j is the gamete’s parent’s maternal DNA
	1 if DNA involved in meiosis i at locus j is the gamete’s parent’s paternal DNA

The vector of meiosis indicators at a single locus j over all the meioses of a pedigree is known as the inheritance vector at locus j [LG87] and is denoted S.j. The elements of S.j are independent of one another, as they represent the inheritance to gametes resulting from different (and hence independent) meioses. Si. is the vector of meiosis indicators at all loci for a single meiosis i (that is, in the formation of a single gamete). Assuming the absence of genetic interference [Hal19], the elements of Si. have first–order Markov dependence. That is, the value ‘0’ or ‘1’ at locus j + 1, given the values at loci 1, 2, ...., j depends only on the value at locus j. Specifically, this probability is a function of the value at locus j and the recombination fraction between the loci j and j+1.

If meiosis indicators are known, identity by descent (ibd) is also known. If probabilities can be assigned to patterns of meiosis indicators in a pedigree, the probability that any set of gametes in the pedigree are ibd can in principle be computed.

See Concept Index for: specifying inheritance, meiosis indicators, founder genome labels, inheritance vector, ibd.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.2 Genetic model

See References, for details of the cited papers.

In MORGAN there are three basic genetic data types. These are: be genotypic, typically used for marker data; discrete, a data type for binary data requiring specification of incomplete penetrances; and quantitative, using a Gaussian penetrance with specification of genotypic means and residual variance.

As yet, loci are either multiallelic marker loci assumed observed without error, or trait loci which may have general penetrance functions but must have two alleles. Gradually, available models are being generalized:

Pedigree peeling for multiallelic loci with general penetrance;
In order to allow models for “non-genotypic” markers, general joint peeling programs have been implemented, based on Thompson in [Tho77]. For zero-loop pedigrees (see [Tho76]), these peeling routines are used by the lm_map program which allows for errors in marker data. For general pedigrees, they are not yet released, as they are still in process of testing.
Penetrance functions and trait models:
Liability classes been implemented for the discrete-trait penetrance model. A parameter statement specifies the number of liability classes and the 3 genotype-specific penetrances in each class. Additionally, an age-based penetrance function for a qualitative trait has been implemented, with age specified as a covariate for the discrete trait. The mean ages of onset for each genotype are provides via the genotypic means parameter statement, and standard deviations are separately provides. An additional parameter statement specifies a window-width for the age-of-onset data.
Traits and trait loci:
The program lm_twoqtl allows two (linked or unlinked) quantitative trait loci to contribute additively or epistatically to a single trait [STW07]. A polygenic component may also be also be included. Two-locus penetrances may be specified as additive, with a genotypic mean for each trait genotype for each locus. Alternatively, a matrix array of 2-locus genotypic means may be specified, allowing for epistasis [SW07].

With more these more complex trait models, including those of lm_twoqtl [STW07], a more general specification of traits is required. From MORGAN V3.0, completely new structures have been introduced, separating traits (phenotypes) from trait loci (“tlocs”). Traits may be affected by genotypes at several tlocs; the genotypes at a tloc may affect several traits.

See Concept Index for: genetic model, penetrance.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.3 Exact HMM computations

Using the inheritance vectors or meiosis indicators, the structure of the problem is that of a hidden Markov model (HMM) with the Markov latent state being the S.j, Markov over markers j. When the pedigree is small, so that each S.j takes only a practical number of values, standard exact HMM computational methods apply. Likelihoods and lod scores can be computed exactly. Alternatively, a single forwards computation followed by (repeated) backwards sampling provides multiple independent realizations from the joint distribution of all the Sij given the marker data (or given the marker and trait data, if the latter is included in the set of loci j).

Note that in fact Sij are independent over meioses i, so that the structure is that of a factored HMM. Forward HMM computation for multiple meioses has been replaced by a factored version (FHMM), enabling much faster exact computation on small pedigree components and multiple-meiosis sampling for larger numbers of meioses.

Exact computation of lodscores on small pedigree components has been implemented for lm_linkage using the FHMM version of the Baum algorithm. Additionally, these FHMM computations are also a component of MCMC sampling on larger pedigree components (see next section).

Gold standards for exact computation are added in the Lodscore/Gold2 subdirectory.

See Concept Index for: exact computations, HMM computations.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.4 Single and multiple meiosis LM-samplers

See References, for details of the cited papers.

MORGAN’s Autozyg and Lodscore programs use MCMC to estimate ibd probabilities and multilocus lod scores, respectively, in pedigrees. The latent (unobserved) parameters of interest in MCMC estimation of ibd probabilities and lod scores are the meiosis indicators at marker and/or trait loci for each non-founder in the pedigree. Observed data are trait values and unphased marker genotypes for some or all pedigree members. With unphased genotypes, it may or may not be possible to determine the grandparental source (i.e. the meiosis indicator) of each allele unambiguously. MORGAN uses MCMC to sample meiosis indicators (S) conditional on observed data (Y).

MORGAN implements two different block Gibbs samplers, a locus- and a meiosis-sampler, for sampling from S conditional on Y. Each method updates a subset, S_u, of S conditional on Y and on the currently fixed values of the rest of S (S_f). The difference between the two methods is the choice of S_u.

The locus-sampler (or L-sampler) chooses S_u to be S.j for some j. In other words, a single locus is selected and meiosis indicators at that locus are updated based on the genotype data at all loci and on the current realization of meiosis indicators at all loci other than j. The MORGAN user can determine whether a locus is to be selected at random each time or if loci are taken in a pre-determined random order, as described in the next section. The update computations use a modification of the Elston-Stewart algorithm [ES71] and can be used whenever single locus pedigree peeling is possible. If inter-locus recombination fractions are strictly positive, the L-sampler is irreducible. On the downside, mixing is poor if loci are tightly linked.

The single-meiosis sampler (or M-sampler) chooses S_u to be Si. for some i. It is, in a sense, perpendicular to the L-sampler in that at each iteration a single meiosis is selected and meiosis indicators for that meiosis are updated conditional on the genotype data at all loci and the current realization of meiosis indicators for all other meioses. The M-sampler is a modification of the Lander-Green algorithm [LG87] for peeling along a chromosome using the Baum algorithm [Bau72]. At each iteration, a single meiosis is randomly selected or meioses can be updated sequentially. As with locus selection in the L-sampler, MORGAN allows the user to choose the meiosis selection The M-sampler mixes well in the presence of tightly linked loci, but it can perform poorly in large pedigrees with missing data.

The multiple meiosis sampler updates several meioses jointly and is therefore a generalization of the old single-meiosis sampler. There are four types of multiple-meiosis updates: random meiosis update, individual update, sib update and 3-generation update. This is based on work by Liping Tong in [TT08]. The new LM-sampler is a combination of L-sampler and multiple-meiosis M-sampler. This new LM-sampler is implemented in the program lm_linkage, which combines the earlier programs lm_markers and lm_multiple. The new LM-sampler can also be used in the programs lm_auto and gl_auto, and in the program lm_twoqtl. All these MORGAN 3.0 programs sample inheritance patterns conditional on marker data for use in subsequent lod score or ibd computations. They all have the option to use either the old (single-meiosis) or new (multiple-meiosis) LM-sampler.

MORGAN’s Autozyg and Lodscore programs use a combination of the L- and M-samplers, referred to as the LM-sampler. The user may choose the fraction of updates that are of each type; the default is 20% L-sampler, 80% M-sampler. The recommendation is 20/80, 50/50 or 80/20, depending on which sampler is, in any particular example, the more computationally intensive.

For original descriptions of the L-sampler see [He97], and for the M-sampler and LM-sampler see [TH99]. For additional mathematical details on the L-, M- and single-meiosis LM-samplers, see [Tho00]. For the new multiple-meiosis sampler see [TT08].

Up to MORGAN V2.8.2, MCMC was performed globally over pedigree components (except those small enough for exact computation). The L-sampler peeling and lod score estimation could be done either by component (using “set peeling by component”) or globally (the default).

With MORGAN V3.0, the preferred option is to do both MCMC and pedigree peeling (lod score estimation) by component, and to use exact computation on all sufficiently small component pedigrees. The alternative, retained so that older data sets can be rerun, is to use ‘set global MCMC’, in which case no exact computation will be done, and MCMC will be done globally over all component pedigrees.

The lm_haplotype program is a generalization of lm_multiple in which haplotypes of key individuals dividing the pedigree are sampled in addition to meiosis indicators. To facilitate efficient implementation of this algorithm, new peeling-by-component routines need to be implemented and checked. This program is also the work of Liping Tong. This program is not yet released.

See Concept Index for: LM-sampler, pedigree peeling, multiple meiosis sampler, block Gibbs sampler, L-sampler, M-sampler, L-sampler probability.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.5 MCMC computational options

MORGAN can obtain a starting configuration for S in one of two ways. The default method is by sequential imputation. The alternative is to construct an L-sampler realization independently for each locus, conditional on the genotype data at that locus only (the locus-by-locus option). Sequential imputation tends to produce initial configurations that have higher conditional probabilities, but locus-by-locus sampling can sometimes reveal other modes in the complex space of S values. The MORGAN user can select the independent-loci setup method by including the ‘use locus-by-locus for setup’ statement. If sequential imputation is selected, the user can specify the number of sequential imputation samples from which the starting configuration of meiosis indicators is to be selected, using the ‘use I sequential imputation realizations for setup’ statement. The default is 10% of the total MC iterations.

At each MCMC iteration, MORGAN selects a locus (with L-sampler) or set of meioses (with M-sampler) to update. Two different selection methods are available: sample by step and sample by scan. If ‘sample by scan’ is chosen, all loci or meioses are updated one-at-a-time in a predetermined random order. This option is the default. If ‘sample by step’ is chosen, a single locus or meiosis is randomly selected for updating at each iteration. The sampling method selected applies to the entire MCMC run, including burn-in, pseudo-prior computation and main iterations.

When running a MORGAN MCMC program, the user must specify the desired number of several types of iterations. For all programs, some number of initial burn-in iterations must be performed. These realizations are discarded and, if the burn-in period is sufficiently long, subsequent points will be dependent samples from approximately the desired stationary distribution. The ‘set burn-in iterations’ statement is used to specify the number of desired burn-in iterations, with the default value varying by program. The desired number of ‘main’ iterations must be specified using the ‘set MC iterations’ statement; there is no default number of main iterations. For real-data analyses the recommended number of iterations is on the order of 10^5.

The lm_bayes program samples not only meiosis indicators, but also the location of the trait locus. This is done via a third type of MCMC Metropolis-Hastings update. The counts of sampled trait-locus locations is used to calculate pseudo-priors, which are then used in lod score estimation. Alternatively, pseudo-priors can be read from an input file. The goal of this two-stage procedure is to weight locations in order to encourage the MC sampler to visit test positions of low conditional probability. The number of iterations for calculation of pseudo-priors is set using the ‘set pseudo-prior iterations’ statement, or the default value of 50% of the number of main iterations can be used.

Specific Autozyg and Lodscore programs have additional parameters and options that are described in the relevant sections of the next three chapters of the tutorial.

In addition to the main program-specific outputs described in the following chapters, the MCMC process accumulates diagnostic counts, scoring the configuration of meiosis indicators at intervals determined by the same statement compute scores every I iterations as is used for scoring for the primary output. (By default, scores and diagnostic output are computed every iteration.)

There are three components to the MCMC diagnostic output:

Average total log-probability of segregations:
This is the average (over the scored iterations) of the total (over meioses) of the log-probability of the meiosis indicators. For the first locus this is simply the marginal probability log((1/2)^m) for m meioses, and for each successive locus is log P(S.j | S.(j-1)) for locus j conditional on locus j-1.
Average total log-probability of penetrances, by locus
This is the average (over the scored iterations) of the combined (over observed individuals) log-probability of the observed data at the locus given the inheritance configuration S.j.
Recombination counts for map intervals
This is the total count over (male and female) meioses and over MCMC iterations of realizations of configurations of meiosis indicators that are recombinant and non-recombinant in each interval of the map.

In these diagnostic scores, for the programs lm_pval and lm_linkage only marker loci and marker map intervals are included in these diagnostic scores. For lm_auto, the trait locus (designated ‘0’) is included in the correct position, if it is included in the MCMC, while the gl_auto program requires a null (no-data) unlinked trait locus. This null locus may or may not be included in the MCMC sampling: see ‘set MCMC markers only’ in Autozyg MCMC parameters and options.

See Concept Index for: MCMC options, sequential imputation, locus-by-locus setup, sample by step, sample by scan, burn-in, MC iterations, pseudo-prior iterations.

[ < ]

[ > ]

[ << ]

[ Up ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

8.6 MCMC parameter statements

These statements which set the parameters for the MCMC algorithms, apply to both Autozyg programs and to the Lodscore programs, unless otherwise noted.

use (locus-by-locus sampling | sequential imputation) for setup: There are two setup methods available to find a starting configuration for the meiosis indicators prior to the MCMC: using sequential imputation (with the trait treated as unlinked), or using locus-by-locus sampling (by assuming all markers and trait are unlinked). Sequential imputation is the default method.
use I sequential imputation realizations for setup: When sequential imputation is selected above, this statement specifies the number of sequential imputation samples from which the starting configuration of meiosis indicators is to be selected. The default is 20 iterations.
set MC iterations I: Required. It specifies the total number of main L- and M-sampling iterations. There is no default number of MCMC iterations; the total number of ‘main’ L- and M- sampling iterations must be specified for all Autozyg programs. The total MCMC run length is the sum of the number of burn-in iterations specified by the ‘set burn-in iterations’ statement and the number of main iterations specified in ‘set MC iterations’.
set burn-in iterations I: Burn-in iterations are performed initially, with the trait locus (if any) unlinked to the marker map. The default number of burn-in iterations is specific to each program.
sample by (scan | step): By default (sample by scan), all loci (L-sampler) or all meioses (M-sampler) are updated successively in an order determined by random permutation. When sampling by step, a single locus (L-sampler) or single meiosis (M-sampler) is randomly selected for updating. lm_bayes presently samples by scan only.
set L-sampler probability X: The L-sampler probability, between 0.0 and 1.0, specifies the probability in each MCMC iteration, of locus-sampling rather than meiosis-sampling. The default is 0.0, that is, to use M-sampler only.
compute scores every I iterations: The default is to score recombinations, total log-probabilities or the Rao-Blackwellized estimator every MCMC iteration. This statement specifies the frequency with which to compute the contributions to the ibd scores or the location lod scores.
check progress I MC iterations: Use this statement to monitor the progress of the program as it is running. It will print out the iteration number every Ith iteration.
set global MCMC: By default, MCMC is performed component-by-component, and exact computation and/or iid sampling is used on small pedigree components. If this statement is specified, MCMC will be done globally over the data set, and no exact computation will be done. Note that the formerly used set peeling by component is eliminated; only global lod scores or analysis will be performed if the global option is chosen. The recommendation is not to use this option, but it is retained for compatibility with older examples and data sets.
set limit for exact computation I1: This is the limit on the number of meioses in order for exact computation and iid sampling to be used instead of MCMC. The default value is 8; while exact computation for more than 8 meioses is certainly feasible it is often not computationally efficient.

See Concept Index for: MCMC parameter statements, sequential imputation, locus-by-locus sampling, meiosis indicators, MC iterations, burn-in, L-sampler, M-sampler, sample by scan, sample by step.

[ << ]

[ >> ]

[Top]

[Contents]

[Index]

[ ? ]

This document was generated by Elizabeth Thompson on September 6, 2019 using texi2html 1.82.