[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
See Concept Index for: population-based IBD inference.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
See References, for details of the cited papers.
The program ibd_create
is a suite of seven subprograms that
together provide a set of tools for the creation of haplotypes
and ibd specifications, including ibd graphs
(See The ibd_class utility). These programs produce realistic
simulated data for use in testing analysis programs such as ibd_haplo
.
Similarly to the ibd_class
utility
(See The ibd_class utility),
ibd_create
calls each of its seven subprograms through a command line
option. This option also determines which MORGAN parameter
statements will be recognized and how they will be interpreted.
All the examples for ibd_create
and ibd_haplo
are based
on the current gold standards. The marker data for these gold standards have
been updated
to the publicly available
European samples of the 1000 Genomes project. These provide 758 phased and
imputed haplotypes based on the GBR, FIN, IBS, CEU and TSI subpopulation
samples.
The phased haplotypes were downloaded from the Browning website,
http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes.phase1_release_v3/
while allele frequency and map position information
were obtained from the original 1000 genomes vcf files:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521/
(See http://mathgen.stats.ox.ac.uk/impute/README_1000G_phase1integrated_v3.txt
for additional information.)
A BEAGLE DAG model was constructed for these haplotypes by running
BEAGLE with a scale parameter 1.0. All files with the partial name
"eur_recode_s1" derive from this data set and DAG model.
For greater clarity we here divide the programs into three subgroups. The first two subprograms are:
beaglesim
is to realize haplotypes
on this set of real SNPs, at known real SNP locations, with the sample SNP
allele frequencies, and with the LD structure as fit by BEAGLE to the original
sample.
beaglesim
also allows an “LD relaxation parameter” to
allow haplotypes to be generated from the same DAG but at varying LD levels.
beaglesim
then be run without additional LD tuning
(i.e. the LD relaxation probability parameter set equal to 0).
beaglesim
can take and produce data either as character (A,C,G,T) or
numeric (1=reference allele, 2=alternate allele). For downstream
use in other MORGAN programs, it will be found more
convenient to have converted to numeric allele labels.
beaglesim
generates realizations by haplotype,
but can output these haplotypes either as rows or as columns or in
MORGAN marker data genotypic format
(See ‘marker data’ in Concept Index).
Care must be taken in
ensuring the correct row/column specification for downstream use in
other MORGAN programs, or other software.
For additional information about beaglesim
see [BGZT12].
gl_auto
or
ibd_haplo
. This would provide an adjustment for LD
when this inferred ibd is used in gl_lods
to produce Monte Carlo
lod score estimates (See Parameter files for the gl_lods program).
(See The ibd_class utility for more information on ibd graphs).
The next subprogram of ibd_create
is:
ibd_create
It is replaced in MORGAN V3.4 by a new
version of ibddrop
See Introduction to ibddrop.
The new ibddrop
both
simulates descent at a finite number of linked markers, but also can
now
simulate the recombination breakpoints across the chromosome.
The final three subprograms of ibd_create
are
In the latter case, a parameter statement is provided to covert from the bp scale to the cM of other MORGAN programs.
set proband gametes
statements are used to define the gametes among which ibd
is scored.
The output is ibd
states for the specified gametes, at the
specified marker locations: this output is in a variety of formats
specified more fully below.
beaglesim
from a BEAGLE DAG that was fit
to such haplotypes. Each FGL is assigned a unique haplotype. To permit
multiple realizations on a single IBD structure, the set of haplotypes is
randomly permuted before the assignment is made.
simpop_fgl
.
Its output is an ibd graph in the format produced
by the gl_auto
program, which can therefore be input into
gl_lods
; See Introduction to lm_auto gl_auto and lm_pval.
The program ibd_haplo
computes conditional probabilities of gene IBD (identity
by descent) states, given data for marker loci for specified
sets of proband gametes (i.e. scoresets).
The proband gametes in each scoreset are
specified gametes (maternal or paternal) of specified individuals. These
are input as for the lm_auto
program:
See Sample lm_auto parameter file.
The program has been generalized to allow for sets of up to ten gametes, although computational limitations suggest that considering more than seven gametes jointly is impractical. Internally, IBD states are the gametic states (that is, 15 states for 4 gametes), and states are ordered lexicographically, although the “traditional” Jacquard ordering may be requested for sets of four gametes.
The marker data are read in using a standard MORGAN marker data file; See Sample lm_auto parameter file. Each individual named in the marker data must have a unique name. There may be missing data, but each single-locus genotype of an individual must be either present or absent(" 0 0"); presence of a single allele cannot be specified. The marker data are read in as genotypes, but they may be analyzed as an ordered pair of alleles (i.e. phased), and must be so if only a single gamete of any individual is specified in the scoreset. (??)
The program uses a HMM model for the latent IBD states. There are two options for the transition matrix of the HMM latent IBD state; the one applicable to any number of gametes is the ‘2011’ matrix developed by Chaozhi Zheng [BGZT12]. Given the latent state, the locus-specific genotype probabilities are based on the premise that IBD DNA should be of the same allelic type, and that non-IBD DNA is of independent allelic types, although allowance is made for typing error to eliminate zero emission probabilities. The transition matrices are also modified to eliminate zero transition probabilities. A forward-backward HMM computation provides the probabilities for each IBD state at each locus for each set of proband gametes.
For the case when the scoreset consists of both maternal and paternal gametes of a set of (normally two) individuals, additional options are available. For two individuals there are 15 IBD states, although only 9 are distinguishable from unphased genotypes. In this case there are two transition matrices available; the earlier ‘2009’ matrix of [Tho08b] is also an option. The input genotypes may be analyzed as an ordered or unordered pair of alleles (i.e. phased or unphased). There is also an option for partial phasing, in which segments of chromosome (sets of contiguous markers) are specified as phased. If the data are analyzed as unphased or partially phased, the output state probabilities are of the genotypically distinguishable state classes (i.e. 9 states instead of 15). Finally, although internally the program still works with a lexicographic ordering of states, for four gametes output may be in terms of the more conventional ordering of the Jaquard states.
The methods and study results of this approach are provided in
[Tho08b] and [BGZT12].
Note that the data files and software released for
[BGZT12] are for an earlier version of ibd_haplo
. The version
described here includes improvements both in user interface, and also in the
way the IBD transitions are implemented.
This version was first released
for MORGAN V3.1.1 with more minor improvements for MORGAN 3.2.
Specifically,
for MORGAN 3.2 there have been modifications in the computation
of locus-to-locus transition probabilities, providing for better
approximation to the underlying continuous process modeled in
[BGZT12].
See Concept Index for:
ibd_create
introduction,
ibd_haplo
introduction,
marker data,
proband gametes,
ibd graph
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
beaglesim
and beagledag
beaglesim
:
# Include everything in the output file. set printlevel 5 # Provide a file name for the beaglesim input DAG file. input extra file "./eur_recode_s1.bgl.dag.gz" # Provide a file name for the beaglesim haplotypes file. output overwrite scores file "./eur_recode_s1.haplotypes" # The sampler seeds are going to be 53 and 5353 (ie '0x35' '0x14e9'). set sampler seeds 53 5353 # Set the sampler seeds. # The following Morgan parameter statements are needed by beaglesim. output 758 haplotypes as rows set LD relaxation probability 0.05 |
These statements are mostly self-explanatory: See beaglesim and beagledag parameter statements for more details.
beagledag
:
# Include everything in the output file. set printlevel 5 # Provide a file name for the beagledag input DAG file. input extra file "./eur_recode_s1.bgl.dag.gz" # Provide a file name for the beagledag reduced DAG file. output overwrite scores file "./eur_recode_s1.reduced" |
Note that beagledag
is still a prototype program. This
particular version is hard-wired to compute local haplotype frequencies
for a specific subset of consecutive
markers from this DAG. The subroutine structures are more general and
a more flexible example of the calling program
will be released in the future.
See Concept Index for:
sample parameter file for beaglesim
and beagledag
,
using the BEAGLE DAG.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The ibd_create
subprograms are called by invoking the program with
a flag specific to the subprogram. Examples files are included in the
‘Haplo’ subdirectory of MORGAN_Examples
. Here are the relevant
run commands
Running ibd_create
with the -s
options runs beaglesim
.
The example runs on the BEAGLE DAG [Browning06] that was fit to 758
European haplotypes of the 1000 Genomes Data. Haplotypes are generated
from the DAG model. It is recommended that both the output haplotypes
and the unzipped version of the DAG file are removed after the program is
run: these are large files.
Running ibd_create
with the -d
options runs beagledag
.
This program computes local haplotype frequencies for a specific set
of markers in the DAG file. It is still very much a prototype program; it
may be generalized in the future.
It is recommended that both the output reduced DAG file
end the unzipped version of the DAG file are removed after the program is
run: these are large files.
See Concept Index for;
running beaglesim
and beagledag
.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The formats for simpop_fgl
have been modified for
MORGAN version 3.3.2 and subsequent.
This subprogram now works in centiMorgans
rather than base pairs.
The Gold standard parameter file is identical to the
the one in the ‘MORGAN_Examples/Haplo’ subdirectory.
simpop_fgl
:
set printlevel 3 # Provide a file name for the simpop_fgl results file. output overwrite scores file "./simpop_fgl.ibdgraphs" set sampler seeds 11 7654 # Set the sampler seeds. # The following Morgan parameter statements are being used to pass the integers # and double precision reals needed as arguments for the simpop_fgl subprogram. set 19 females per generation set 30 offspring generations output final 2 generations set chromosome length 120.0 centiMorgans set 0.9 remating probability weight |
These statements are mostly self-explanatory:
See simpop_fgl parameter statements
for more details, including the control of individuals as parents through
the remating probability
.
See Concept Index for;
sample parameter file for simpop_fgl
,
simulation of a population,
simulation of descent in a pedigree.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The examples here are the ones in the ‘IBD_Haplo/Gold’ directory.
They may be run as the gold standards simpop_fgl_gold
which is a part of gold.4
in this
directory. Alternatively they may be run directly in the Gold
directory as follows:
Running ibd_create
with the -p
options runs simpop_fgl
.
In each case,
a file containing the ibd graph of the requested individuals is produced.
Note that these ibd graphs have a slightly different format from those
used in the gl_auto
and gl_lods
programs. They are indexed
in micro-centiMorgans,
and contain additional information about the simulated population pedigree.
The fgl2dgl
subprogram provides a translation between these two
forms of ibd graph.
See Concept Index for;
running simpop_fgl
,
ibd graph.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
See References, for details of the cited papers.
There are two basic alternative requests for the fgl2ibd
subprogram of ibd_create
. One requests scoring of ibd
among all pairs of individuals; the other, used in this example, requests
ibd scoring for specified sets of proband gametes. This example
requests sets size 2, 3, 4, 5, 6, 7 and 10; the scoreset names indicate this
but the names are arbitrary.
In this example, the marker data file provides marker locations in centiMorgans: thus any scaling factor to convert from base pairs to a genetic map is unnecessary and (if included) has no effect, although the output will report the scaling factor..
Here is the fgl2ibd_varying.par
parameter file for this example:
# Provide the simulated fgl (simpop_fgl's output) file name. input gamete data file "./simpop_fgl.ibdgraphs" # Each FGL in the data set is assigned a unique haplotype. # The "sampler" seeds are used to randomly permute the haplotypes before # assignment, to permit multiple realizations on a single IBD data set. set sampler seeds 0x00003039 0x00000431 # Replace by a seed file for real runs. # Provide a marker data file name of a file containing the marker map. input marker data file "./eur_recode_s1_map.markers" # Provide a file name for the fgl2ibd results file. output overwrite scores file "./fgl2ibd_varying.statelabels" # Provide a file name for the fgl2ibd score set identifiers file. output overwrite extra file "./fgl2ibd_varying.ids" # The following Morgan parameter statements are being used to specify which # proband gametes to include in each analysis run by the fgl2ibd subprogram. set scoreset 21 proband gametes 1148 0 1161 0 set scoreset 31 proband gametes 1148 0 1158 1 1162 0 set scoreset 41 proband gametes 1158 0 1158 1 1168 0 1168 1 set scoreset 51 proband gametes 1148 0 1151 0 1158 0 1161 0 1168 0 set scoreset 61 proband gametes 1158 0 1158 1 1161 0 1161 1 1168 0 1168 1 set scoreset 71 proband gametes 1148 1 1151 1 1158 1 1161 1 1168 1 1171 1 1178 1 set scoreset 101 proband gametes 1148 0 1148 1 1151 0 1151 1 1158 0 1158 1 1161 0 1161 1 1168 0 1168 1 # The following Morgan parameter statement specifies the marker locations # at which ibd will be scored. select markers 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400 410 420 430 440 450 460 470 480 490 500 ......... 4410 4420 4430 4440 4450 4460 4470 4480 4490 4500 4510 4520 4530 4540 |
For the fg12haplo
program there are several options regarding the
format of the input haplotypes. Here we give one example where the
input haplotypes are specified as columns. The marker map is in
centiMorgans so any scaling factor declared is irrelevant. However tha input
value will be reported in output (or 1.0 will be reported, if the statement
is not included)..
# Provide the ibd graph input (simpop_fgl's output) file name. input gamete data file "./simpop_fgl.ibdgraphs" # Provide the output haplotype labels and output with spaces options. output haplotype labels output with spaces # The following Morgan parameter statement specified the markers # over which haplotypes will be generated. (Typically one will generate # complete haplotypes.) select all markers # The following Morgan parameter statement is irrelevant unless the input # marker map is specified in base pairs. For a centiMorgan map there is # no scaling. However the output will report the input value, or # 1.0 if the statement is not included. set 1.2 million base pairs per centiMorgan # Provide a marker data file name of a file containing the marker map. input marker data file "./eur_recode_s1_map.markers" # Provide the founder haplotypes file name. input extra file "./eur_recode_s1_hapcols" # Tell fgl2haplo how to interpret the haplotype data read in from the founder # haplotypes file. input 758 haplotypes as columns of 4547 snps # Provide a file name for the fgl2haplo results file. output overwrite scores file "./fgl2haplo_haps_as_cols.results" |
The fgl2dgl
subprogram converts the ibdgraphs file from simpop to
the form used gy the gl_lods
program;
See Parameter files for the gl_lods program. The output
file scores switchpoints at marker locations. In this example
the marker locations are specified in base pairs, and a non-standard
scaling is used in the conversion to illustrate the use of this
statement. This is the ‘fgl2dgl_bp_0.75.par’ file:
# Provide the simulated fgls (simpop_fgl's output) file name. input gamete data file "./simpop_fgl.ibdgraphs" # Provide a marker data file name of a file containing the marker map. input marker data file "./fgl2dgl_bp_map.markers" # Provide a file name for the fgl2dgl results file. output overwrite scores file "./fgl2dgl_bp_0.75.ibdgraphs" # The following Morgan parameter statement is required in order to satisfy # the proc_auto subroutine. select markers 120 130 290 450 525 770 979 980 1091 1290 1492 1597 1726 1900 2116 2400 2810 3055 3250 3400 # The following Morgan parameter statement is now optional. If the user needs # to specify something other than 1.0 million base pairs per centiMorgan, it # can be used to pass a double precision real number as a scale factor to be # used by the fgl2dgl subprogram. set 0.75 million base pairs per centiMorgan |
See Concept Index for;
sample parameter files for fgl2ibd
, fgl2haplo
and fgl2dgl
.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
These two programs have a variety of input formats and output options.
Here we include only one example of each program. For additional information
see the README_userdoc file in MORGAN/IBD_Haplo
of
fgl2ibd and fgl2haplo and fgl2dgl parameter statements. All three
programs use as input
a file of ibd graphs in the format produced by simpop_fgl
.
Running ibd_create
with the -i
flag runs fgl2ibd
.
The
program fgl2ibd
simply scores the ibd
in that file at
specified markers for specified individuals. Note that you should retain
(or re-create) the simpop_fgl
output file, ‘simpop_fgl.ibdgraphs’
which is used as an input in fgl2ibd
. The program produces
two output files. The first specifies only the scored proband
gametes, while the second gives the state specification at each marker
for each proband gamete set. This separation simplifies downstream
analysis, particularly if using R. Although for this example the output files
are not large, it is good practice to remember to remove the output
files after running the program.
Running ibd_create
with the -f
flag runs fgl2ibd
.
The program fgl2haplo
uses the supplied ibd graphs (for example those
produced by simpop_fgl
and supplied haplotypes to generate genetic
marker data for specified individuals, in accordance with their ibd
across the chromosome. Note that the "haps_as_cols" refers to the input
format of haplotypes, not the output. Output is either as haplotypes
or genotypes in rows. Remember to remove any large output
files after running the program.
Running ibd_create
with the -g
flag runs fgl2dgl
.
This program takes the supplied ibd graphs in the format produced by
simpop_fgl
, and produces an ibd graph (or dgl graph) in the
format produced by gl_auto
and used in programs such as
gl_lods
. The input has switch-points in micro-centiMorgans.
The output form provides switch-points at selected markers,
so that, if the marker map is in base-pairs rather than centiMorgans, the
program makes the appropriate conversion.
See Concept Index for;
running fgl2ibd
, fgl2haplo
and fgl2dgl
,
phased and unphased genotypes.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Three sample parameter files for ibd_haplo
can be found in the directory ‘MORGAN_Examples/Haplo’.
All three examples are based on examples in the Gold standards for the
program. The examples are analogous to the three examples previously
used, except that the data have been changed to use (simulated) individuals
and haplotypes from the 1000 Genomes data.
The first two examples (‘phased_2011.par’ and ‘unphased_2011.par’) use the same data, and score the same sets of 4 gametes, consisting of the maternal and paternal gametes in five pairs of individuals. The examples differ only in whether the data are treated as phased haplotypes or unphased genotypes.
Here is the unphased_2011.par
parameter file:
set printlevel 3 # See comment below input marker data file "./sim76indivs.markers" output overwrite scores file "./unphased_2011.qibd" output overwrite extra file "./unphased_2011.ids" # The following five Morgan parameter statements specify the # computational set-up for the program select 2011 state transition matrix select unphased data set population kinship 0.05 set kinship change rate 0.05 set transition matrix null fraction 0.05 set genotyping error rate 0.01 output four-gamete state order jacquard # The following Morgan parameter statements are being used to specify which # proband gametes to include in each analysis run by ibd_haplo. set scoreset 1 proband gametes 1107 0 1107 1 1119 0 1119 1 set scoreset 2 proband gametes 1111 0 1111 1 1115 0 1115 1 set scoreset 3 proband gametes 1123 0 1123 1 1127 0 1127 1 set scoreset 4 proband gametes 1131 0 1131 1 1135 0 1135 1 set scoreset 5 proband gametes 1169 0 1169 1 1170 0 1170 1 # The program computes ibd at, and uses only, a subset of the markers: # in fact every tenth marker in this example select markers 10 20 30 40 50 60 70 80 90 100 110 120 130 140 150 160 170 180 190 200 210 220 230 240 250 260 270 280 290 300 .... lines omitted here 4310 4320 4330 4340 4350 4360 4370 4380 4390 4400 4410 4420 4430 4440 4450 4460 4470 4480 4490 4500 4510 4520 4530 4540 |
Since ibd_haplo
typically runs with a very large number of markers
is is advisable to suppress printing of marker map and allele frequencies
using the ‘printlevel’ setting. The file specifications, and marker
data, are as for previous programs such as lm_auto
:
See Sample lm_auto parameter file. Note that there are two output files.
The marker data file ‘sim76indivs.markers’
contains the positions and allele
frequencies of 4547 markers and marker genotypes of 76 individuals.
(These data were created using simpop_fgl
and fgl2haplo
,
starting from the BEAGLE DAG fit to the European chromosomes of the
1000 Genomes data.
In this example a subset of the markers is selected– see the end of the file.
The second group of statements relate to the ibd_haplo
implementation.
The ‘2011’ transition matrix is to be used; this is the one described
in [BGZT12] and is recommended. The earlier ‘2009’ option of
[Tho08b] is retained for backwards compatibility. The data are to be analyzed
as unphased genotypes, and there are four numerical parameters of the HMM
model. Most importantly these include the ‘population kinship’,
which is the mean a priori
level of pairwise IBD between any pair of gametes.
For compatibility with previous output formats, the parameter file requests ordering of the states in all output information in the “genetic” (or Jacquard) order, rather than lexicographic order. Note that this option is only available for sets of four gametes. Here the four gametes are the two of each of two individuals, and data are analyzed as unphased, so there will be nine states in the output IBD probabilities. (The ordering of the states is given at the end of Running ibd_haplo examples and sample output.)
The next set of statements specifies four sets of four gametes among which IBD is to be scored. Since the data are to be analyzed as unphased it is required that each set contains both the maternal and paternal gametes of individuals. In general, any gametes of individuals in the marker data file may be specified.
The second example parameter file ‘phased_2011.par’ differs only in the names of the output file and in the statement ‘select phased data’, which specifies that the data should be treated as phased haplotypes. In this case it is not necessary that a scoreset consists of both gametes of individuals.
The third example ‘ten_ss.par’ shows the flexibility of scoresets. This example differs in that only a smaller subset of markers is used:
select markers 3010 3020 3030 3040 3050 3060 3070 3080 3090 3100 3110 3120 3130 3140 3150 3160 3170 3180 3190 3200 3210 3220 3230 3240 3250 3260 3270 3280 3290 3300 |
Additionally, the scoresets are quite varied:
set scoreset 61 proband gametes 1174 0 1174 1 1176 0 1176 1 1163 0 1163 1 set scoreset 41 proband gametes 1163 0 1163 1 1169 0 1169 1 set scoreset 43 proband gametes 1165 0 1165 1 1167 0 1167 1 set scoreset 51 proband gametes 1163 0 1163 1 1169 0 1169 1 1165 0 set scoreset 32 proband gametes 1163 0 1163 1 1169 0 set scoreset 21 proband gametes 1176 0 1176 1 set scoreset 42 proband gametes 1174 0 1174 1 1176 0 1176 1 set scoreset 52 proband gametes 1163 1 1165 1 1167 1 1169 1 1170 1 set scoreset 44 proband gametes 1170 1 1170 0 1172 0 1172 1 set scoreset 31 proband gametes 1167 0 1169 1 1169 0 |
The scoresets may have arbitrary numerical indicators, and range in size from 2 to 6 gametes. The program will reorder them according to size.
Note that the Jacquard ordering is requested for the four-gamete scoresets. There are three such scoresets; these will use Jaquard ordering, while for the remainder the ordering will be lexicographic. This again shows the flexibility of the program, but mixing orderings is likely to be confusing in real analyses.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
Run the examples in the ‘Haplo’ subdirectory of the ‘MORGAN-examples’ directory with the following command
./ibd_haplo unphased_2011.par > unphased_2011.out or ./ibd_haplo phased_2011.par > phased_2011.out or ./ibd_haplo ten_ss.par > ten_ss.out |
Each example produces two output files in addition to the standard output. The standard output gives little information when ‘printlevel 3’ is used. The proband gamete sets are specified, and the program reports as it analyzes each set. In between, the program does give the prior probability of IBD states for each size of scoreset requested, and the transition matrix at a distance of 1 centiMorgan – these are mainly for checking purposes.
The two main output files are the ‘qibd’ file and the ‘ids’ file. For the first example, these have been named as ‘unphased_2011.qibd’ and ‘unphased_2011.ids’ with analogous names for the other examples. (Of course, any names for these files can be specified in the parameter file.) Each file contains only numeric data, so that it can be read easily into R or other programs for further analysis. The ‘qibd’ file is the key output of probabilities of IBD states in each scoreset at each marker, computed conditional on marker data. The ‘ids’ file gives the scoresets.
For this example the ‘ids’ output file is
1 4 9 1107 0 1107 1 1119 0 1119 1 2 4 9 1111 0 1111 1 1115 0 1115 1 3 4 9 1123 0 1123 1 1127 0 1127 1 4 4 9 1131 0 1131 1 1135 0 1135 1 5 4 9 1169 0 1169 1 1170 0 1170 1 |
That is, there are five scoresets numbered 1 to 5. Each consists of 4 gametes. These gametes are specified in the usual format: ‘0’ for a maternal gamete and ‘1’ for a paternal, of the individual whose name ID is given. Since the data are analyzed as unphased, there are only 9 IBD states for each scoreset.
The number of lines in the output ‘qibd’ file is the number of scoresets times the number of markers used: 2270 for the first two examples here. For the ‘unphased’ case, each line consists two integers, followed by 10 real numbers. The first line starts
1 10 0.618463 0.0000 0.0000 0.0028 0.0003 0.0028 0.0003 0.8416 .... |
while line 2037 starts
5 2210 58.999373 0.1462 0.0009 0.2902 0.0011 0.1479 0.0007 0.0095 .... |
The first item indicates the scoreset, the second the marker number, and the third the centiMorgan (or Mbp) position of the marker. The remaining 9 numbers are the probabilities of the 9 IBD states. In the first line, most of the probability (0.8416) is in state-7, which is the state ‘1212+1221’. That is the two individual’s are likely to share both gametes IBD at this first locus, but in this state there is no IBD between the gametes within individuals. In the second example, at marker 2210 in scoreset 5, these is probability 0.1462 that all four gametes of the two individuals are IBD, and even higher probability (0.2902) that the two gametes of the first individual are IBD, and shared IBD with one of the two gametes of the second individual.
For sets of 4 gametes, we use the traditional ordering of the 15 IBD states or 9 reduced genotypic states:
The order of the 15 states is 1111, 1122, 1112, 1121, 1123, 1211, 1222, 1233, 1212, 1221, 1213, 1231, 1223, 1232, 1234. For the nine reduced states, the order is the same, but genotypically equivalent ones are combined: 1111, 1122, 1112+1121, 1123, 1211+1222, 1233, 1212+1221, 1213+1231+1223+1232, 1234. |
For more general gamete sets, the ordering is lexicographic.
For more on the specification of IBD states see Sample lm_auto parameter file.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
12.10.1 beaglesim and beagledag parameter statements | ||
12.10.2 simpop_fgl parameter statements | ||
12.10.3 fgl2ibd and fgl2haplo and fgl2dgl parameter statements | ||
12.10.4 ibd_haplo parameter statements |
See Concept Index for population-based IBD inference parameter statements
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The following statements are specific to the
beaglesim
and beagledag
subprograms of ibd_create
, or have a
particular role in these subprograms.
set LD relaxation probability X
beaglesim
can simulate haplotypes from the same base DAG but
at varying (lesser) levels of LD. At each DAG node, with a probability
equal to the parameter, the program selects a random node at the
next level, breaking LD in the generation of this particular haplotype.
The default value 0.0 is generally recommended, except where these
varying LD levels are the target of analysis.
output haplotypes as (rows | columns)
beaglesim
takes the DAG model produced by the BEAGLE software,
and simulates haplotypes from the model. These haplotypes are
generated one-by-one as “rows”, but may be output either as rows
or a columns for easier use in downstream programs.
output I genotypes
This is an alternate output option for beaglesim
. Note that either
an output haplotypes ,,,
or output ... genotypes
is required.
beaglesim
can output its haplotypes in MORGAN marker
data file format, so that they can be more easily used in MORGAN
downstream analyses. The output produced by beaglesim
in this case
consists of a set markers ... data ...
statement with the gametes
ordered as defined for this parameter statement.
output with spaces
This statement can be used to insert spaces between the alleles in the
output file of haplotypes produced by beaglesim
. The file is twice as large, but it may be
easier for input to downstream analysis programs. The default is absence
of spaces.
output haplotype labels
This statement implies the "output with spaces" option for haplotype output. If haplotype labels are requested, and if haplotypes are to be output as rows, then each row will include as the first item the ID number of the individual to whom the haplotype belongs. If labels are requested, and if haplotypes are output as columns, then the first line of the output will contain the ID numbers of the individuals to whom the corresponding haplotype column belongs.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The following statements are specific to the
simpop_fgl
subprogram of ibd_create
, or have a
particular role in these subprograms.
set chromosome length R centiMorgans
This statement is required by simpop_fgl
.
It provides the total length of the chromosome in which these programs simulate recombination breakpoints at each meiosis.
set I offspring generations
This statement is required by simpop_fgl
.
This sets the number of additional generations after the founder
generation that simpop_fgl
will generate.
set I females per generation
This statement is required by simpop_fgl
.
simpop_fgl
simulates generation of constant size, with equal
numbers of males and females in each generation. This parameter
specifies the number of females in each generation.
set X remating probability weight
This statement is optional for simpop_fgl
.
simpop_fgl
simulates each generation by successively
selecting a random male and a random female and generating a male and
a female offspring. If a sampled individual has been a member of
m previous matings, then the selected individual is accepted
with probability X^m. The value X = 1/3 generates reasonable
human pedigrees, and gives an effective population size about equal
to the census size.
(Random mating is achieved by the default parameter value 1.0, but is not
recommended.)
output final I generation
This statement is required by simpop_fgl
.
simpop_fgl
outputs only the last generations that it simulates.
This parameter specifies the number of these final generations that will
output.
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The following statements are specific to the
fgl2ibd
and fgl2haplo
subprograms of ibd_create
, or have a
particular role in these subprograms.
map [chromosome I] marker positions base pairs X1 X2...
Although genetic-map (centiMorgan) locations are to be preferred where
available,
the fgl2haplo
, fgl2ibd
, and fgl2ibd
subprograms
can alternatively be provided with marker locations in base pairs.
set R million base pairs per centiMorgan
The fgl2haplo
, fgl2ibd
, and fgl2ibd
subprograms
can provide marker locations in base pairs. In this case, the scaling
factor R provided by this statement is used to convert these
locations to the centiMorgan scale used by simpop_fgl
.
If no scaling is provided, the program equates
1 million base pairs with 1 centiMorgan.
input gamete data file S
This statement is used by both fgl2ibd
and fgl2haplo
to specify the input file from which the program should read
the fgl-segment compact-format chromosomes that it uses. Currently
it is assumed that the chromosomes are in the ibd graph
format generated by simpop_fgl
. This is a slightly different
format from the ibd graphs produced from analysis of marker data on
a defined pedigree that are produced by the Autozyg
program
gl_auto
. (See The ibd_class utility).
input I haplotypes as (columns | rows) of I2 SNPs
fgl2haplo
requires a set of marker haplotypes which it will
apply to the fgl-based ibd graph. These may be input either
as rows or as columns, but the number of SNPs and haplotypes should
be specified.
input genotypes as rows
Alternatively, the input haplotypes for fgl2haplo
may be input
in standard MORGAN marker genotype format. In this case
each row of marker genotypes will be interpreted as pair of
phased haplotypes.
(See Single and multiple meiosis LM-samplers and
``phased and unphased marker haplotypes''
in Concept Index)
output all genotypes
This is an alternative output option for fgl2haplo
fgl2haplo
can output its haplotypes in MORGAN marker
data file format, so that they can be more easily used in MORGAN
downstream analyses. In this case the individual names from
the simpop_fgl
file are output as the first item in each line
of genotypes.
output with spaces
This statement can be used to insert spaces between the alleles in the
output file of haplotypes produced by fgl2haplo
.
The file is twice as large, but it may be
easier for input to downstream analysis programs. The default is absence
of spaces.
output haplotype labels
This statement implies the "output with spaces" option for haplotype output. If haplotype labels are requested, and if haplotypes are to be output as rows, then each row will include as the first item the ID name of the individual to whom the haplotype belongs. If labels are requested, and if haplotypes are output as columns, then the first line of the output will contain the ID names of the individual to whom the corresponding haplotype column belongs.
output four-gamete state order jacquard
In the case of four gametes, the user may select to output states in the traditional "Jacquard" order: 1111, 1122, 1112, 1121, 1123, 1211, 1222, 1233, 1212, 1221, 1213, 1231, 1223, 1232, 1234. If the gametes are of a pair of individuals, and the data are analyzed as unphased, the output state-probabilities will be reduced to nine, in the ordering: 1111, 1122, 1112+1121, 1123, 1211+1222, 1233, 1212+1221, 1213+1231+1223+1232, 1234.
See Concept Index for; state ordering: Jacquard state ordering: lexicographic
[ < ] | [ > ] | [ << ] | [ Up ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
The following statements are specific to ibd_haplo
, or have a
particular role in this program. Note that a number of the statements
apply only when the proband gamete set consists of the four gametes
of a pair of individuals.
select ([partially] phased | unphased) data
The "select ... data" statement is used to inform "ibd_haplo" whether to handle the data as phased data, unphased data or partially phased data. If the data are phased there is no restriction on whether proband gametes are related to each other or not. If the data are unphased or partially phased it is necessary that the proband gametes are pairs of haplotypes, each pair belonging to a whole individual.
select [2009 | 2011] state transition matrix
There are two different state transition matrices implemented in the
ibd_haplo
program. The user must specify which transition matrix
to use for the analysis: see Sample parameter files for ibd_haplo.
Note that the 2009 matrix is applicable only to sets of four gametes.
set transition matrix null fraction X
This statement sets a parameter that modifies the transition matrix to allow for transitions that can not occur under the base transition matrices. The argument, X, is a real number greater than or equal to 0.0 and less than or equal to 1.0.
set genotyping error rate E
This statement sets the genotyping error rate to be used by "ibd_haplo". The value of R is a real number greater than or equal to 0.0 and less than 1.0. The value 0.01 would be a typical value for R.
set population kinship X
This statement sets the prior population kinship parameter to be used by "ibd_haplo" to X, where X is a real number greater than 0.0 and less than 1.0. Typically in small populations a value from 0.01 to 0.05 might be reasonable.
set kinship change rate X
This statement sets the kinship change rate parameter for IBD. This is the total change rate per centiMorgan. It should be a real number greater than 0.0. It is approximately the prior for the inverse of an IBD segment length in centiMorgans between any pair of haplotypes. However, a smaller value than the typical expected length generally works better.
set [scoreset N] proband gametes N1 K1 N2 K2 …
One or more scoring sets may be given, where a scoring set consists of
two or more haplotypes. If there is more than one set, each set is
assigned a number 1 or greater. The maximum number of haplotypes in each
set is limited to 10, due to computer memory considerations.
Pairs of names and meiosis indicators are given, with 0 indicating
maternal inheritance, 1 indicating paternal inheritance.
At least one proband gametes score set must be specified when running
ibd_haplo
.
This statement is also used by fgl2ibd
set proband gametes all individual pairs
If this statement is used, then the ibd_haplo
program will set up
scoresets of
4 gametes for every pair of observed individuals. Typically the user will
have unphased genotypes for this purpose, although typically the data may be
specified an phased or as unphased.
This statement is also used by fgl2ibd
output four-gamete state order jacquard
In the case of four gametes, the user may select to output states in the traditional "Jacquard" order: 1111, 1122, 1112, 1121, 1123, 1211, 1222, 1233, 1212, 1221, 1213, 1231, 1223, 1232, 1234. If the gametes are of a pair of individuals, and the data are analyzed as unphased, the output state-probabilities will be reduced to nine, in the ordering: 1111, 1122, 1112+1121, 1123, 1211+1222, 1233, 1212+1221, 1213+1231+1223+1232, 1234.
output scores at markers I1 I2 ....
The HMM algorithms of ibd_haplo
require that all markers by used
in the computation of the marker-based conditional probabilities of
ibd states. However, subsequent
analyses may not require computation at every marker location. Thus,
to limit output files, it is possible to request the state probabilities
to be output only at a subset of the markers.
[ << ] | [ >> ] | [Top] | [Contents] | [Index] | [ ? ] |
This document was generated by Elizabeth Thompson on September 6, 2019 using texi2html 1.82.