The SEQERR program estimates the rate at which homozygous major allele genotypes are mis-called as heterozygote genotypes at low frequency markers. This error rate is estimated using observed allele and genotype frequencies at low frequency variants in identity by descent segments. Identity by descent segments in sequence data can be detected using the IBDseq program.
If you publish genotype error rates estimates obtained from the SEQERR program, please cite the following reference:
B L Browning and S R Browning (2013) Detecting identity by descent and estimating genotype error rates in sequence data. The American Journal of Human Genetics 93(5):840-851. doi:10.1016/j.ajhg.2013.09.014
[ top ]
To use SEQERR, enter the following command at the command line prompt:
java -jar seqerr.jar arguments
where arguments is a space-separated list of parameters, each expressed as parameter=value.
chromis the chromosome identifier in the VCF files, and
endare the interval start and end positions. The
endvalue may be omitted if it corresponds to a chromosome end.
ibdlength=2e6) assumes no genetic map is specified and the data are from an outbred human population. If a map file with cM distances is used with data from an outbred human population, a minimum IBD segment length of 2 cM (
ibdlength=2) would be appropriate.
ibdtrimparameter should be large enough so that over-estimation of IBD segment length is negligible. Length units are either base-pair distances or genetic distances, depending on whether a genetic map is specified with the map parameter. The default value of 500 kb (
ibdtrim=5e5) assumes no genetic map is specified and the data are from an outbred human population. If a map file with cM distances is used with data from an outbred human population, a minimum IBD segment length of 0.5 cM (
ibdlength=0.5) would be appropriate.
If the reference or target data contains samples
with ancestry from more than population, you should use the
excludesamples parameter to limit the analysis
to samples from a single population.
[ top ]
Two output files are produced:
an error file (.err) containing error rate estimates broken down by minor allele counts up to the minor allele count determined by the maxmaf parameter.
Four tab-delimited columns are printed to the error file: 1) the allele count, 2) the numerator of the estimator, 3) the denominator of the estimator, and 4) the error rate estimate.
If you split your data by chromosome, and analyze each chromosome separately, you can obtain genome-wide error rate estimates for each allele count by summing the numerator values (column 2) across chromosomes, summing the denominator values (column 3) across chromosomes, and then taking the ratio of these sums.
[ top ]
In this example, the sequence error rate is estimed for simulated sequence data with random errors.
$ # Download example VCF and IBD files: $ wget http://bochet.gcc.biostat.washington.edu/beagle/seqerr.test.vcf.gz $ wget http://bochet.gcc.biostat.washington.edu/beagle/seqerr.test.ibd $ # Download the seqerr program $ wget http://bochet.gcc.biostat.washington.edu/beagle/seqerr.jar $ # Run seqerr to estimate the error rate $ java -jar seqerr.jar gt=seqerr.test.vcf.gz ibd=seqerr.test.ibd out=seqerr.test
[ top ]
The SEQERR program is licensed under the Apache License, Version 2.0 (the License). You may not use the SEQERR program except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.
Some source files in the
are licensed under the
MIT License. See the source files for additional
The SEQERR program is distributed on an "AS IS" BASIS WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
|seqerr.r1181.jar||java executable file|
[ top ]
Copyright: 2013 Brian L. Browning
Last updated: 7 Nov 2013.