SEQERR

Introduction
Running SEQERR
Output files
Example
Download SEQERR

Introduction

The SEQERR program estimates the rate at which homozygous major allele genotypes are mis-called as heterozygote genotypes at low frequency markers. This error rate is estimated using observed allele and genotype frequencies at low frequency variants in identity by descent segments. Identity by descent segments in sequence data can be detected using the IBDseq program.

If you publish genotype error rates estimates obtained from the SEQERR program, please cite the following reference:

B L Browning and S R Browning (2013) Detecting identity by descent and estimating genotype error rates in sequence data. The American Journal of Human Genetics 93(5):840-851. doi:10.1016/j.ajhg.2013.09.014

[ top ]

Running SEQERR

To use SEQERR, enter the following command at the command line prompt:

java -jar seqerr.jar arguments

where arguments is a space-separated list of parameters, each expressed as parameter=value.

Required Parameters

gt=file: specifies the VCF file with genotype data.
ibd=file: specifies the Beagle-format IBD file. The IBDseq program can be used to generate the IBD file.
out=prefix: specifies the prefix for output filenames.

Optional Parameters

chrom=interval: specifies the chromosome interval in the format chrom or chrom:start-end. where chrom is the chromosome identifier in the VCF files, and start and end are the interval start and end positions. The start or the end value may be omitted if it corresponds to a chromosome end.
excludesamples=file: specifies a file containing samples to be excluded from the analysis (one sample identifier per line).
map=file: specifies a PLINK-format genetic map file. HapMap GRCh36 and GRCh37 genetic maps in PLINK format can be downloaded from here
ibdlength=non-negative number: specifies the minimum IBD segment length used in the analysis. The minimum IBD segment length should be sufficiently large so that the rate of false-positive IBD segments is negligible. Length units are either base-pair distances or genetic distances, depending on whether a genetic map is specified with the map parameter. The default value of 2 Mb (ibdlength=2e6) assumes no genetic map is specified and the data are from an outbred human population. If a map file with cM distances is used with data from an outbred human population, a minimum IBD segment length of 2 cM (ibdlength=2) would be appropriate.
ibdtrim=non-negative number: specifies the length to be trimmed from each end of the IBD segment segment used in the analysis. The ibdtrim parameter should be large enough so that over-estimation of IBD segment length is negligible. Length units are either base-pair distances or genetic distances, depending on whether a genetic map is specified with the map parameter. The default value of 500 kb (ibdtrim=5e5) assumes no genetic map is specified and the data are from an outbred human population. If a map file with cM distances is used with data from an outbred human population, a minimum IBD segment length of 0.5 cM (ibdlength=0.5) would be appropriate.
maxmaf=positive number less than 0.5: specifies the maximum minor allele frequency for variants used in the analysis (default: 0.02).
maxmissing=non-negative number less than 0.5: specifies the maximum missing genotype rate for variants used in the analysis (default: 0.05).

If the reference or target data contains samples with ancestry from more than population, you should use the excludesamples parameter to limit the analysis to samples from a single population.

[ top ]

Output files

Two output files are produced:

a log file (.log) containing analysis parameters and a summary of the data.
an error file (.err) containing error rate estimates broken down by minor allele counts up to the minor allele count determined by the maxmaf parameter.

Four tab-delimited columns are printed to the error file: 1) the allele count, 2) the numerator of the estimator, 3) the denominator of the estimator, and 4) the error rate estimate.

If you split your data by chromosome, and analyze each chromosome separately, you can obtain genome-wide error rate estimates for each allele count by summing the numerator values (column 2) across chromosomes, summing the denominator values (column 3) across chromosomes, and then taking the ratio of these sums.

[ top ]

Example

In this example, the sequence error rate is estimed for simulated sequence data with random errors.

$ # Download example VCF and IBD files:
$ wget https://bochet.gcc.biostat.washington.edu/beagle/seqerr.test.vcf.gz
$ wget https://bochet.gcc.biostat.washington.edu/beagle/seqerr.test.ibd

$ # Download the seqerr program
$ wget https://bochet.gcc.biostat.washington.edu/beagle/seqerr.jar

$ # Run seqerr to estimate the error rate
$ java -jar seqerr.jar gt=seqerr.test.vcf.gz ibd=seqerr.test.ibd out=seqerr.test

[ top ]

Download SEQERR

The SEQERR program is licensed under the Apache License, Version 2.0 (the License). You may not use the SEQERR program except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

Some source files in the net.sf.samtools package are licensed under the MIT License. See the source files for additional license information.

The SEQERR program is distributed on an "AS IS" BASIS WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

seqerr.r1181.jar	java executable file
seqerr.r1181.zip	source code

[ top ]

SEQERR

Contents

Introduction

Running SEQERR

Required Parameters

Optional Parameters

Output files

Example

Download SEQERR