conform-gt program

Version: r1174
Email: browning@uw.edu

Contents


Introduction

The conform-gt program modifies a target Variant Call Format (VCF) so that its records are consistent with a reference VCF file.

The target VCF file is required to have data for at least 20 individuals so that statistical tests performed by the conform-gt give reasonable results.

The conform-gt program will:

  1. Find corresponding VCF records in the target and reference files.
  2. Exclude target VCF records that do not correspond to any reference VCF record.
  3. Exclude target VCF records whose chromosome strand cannot be determined. The conform-gt program makes use of allele frequency and inter-marker correlation to determine chromosome strand in ambiguous cases.
  4. Adjust target VCF records so that chromosome strand and allele order match the VCF reference file.

VCF records are permitted to have multiple non-reference alleles.

[ top ]


Running conform-gt

To use conform-gt, enter the following command at the command line prompt:

java -jar conform-gt.jar arguments

where arguments is a space-separated list of parameters, each expressed as parameter=value.

Required Parameters

ref=file
specifies the VCF file with reference genotype (GT) data.
gt=file
specifies the VCF file with target genotype (GT) data.
chrom=interval
specifies the chromosome interval in the format chrom or chrom:start-end. where chrom is the chromosome identifier in the VCF files, and start and end are the interval start and end positions. Either start or end may be omitted if they correspond to a chromosome end. Variants in the reference or target VCF files that are outside the specified chromosome interval will be ignored.
out=prefix
specifies the prefix for output filenames.

Optional Parameters

match=ID or POS
specifies the field used to match reference and target records. If match=ID, the ID field will be used. If match=POS, the POS field will be used (default: match=ID). If the reference and target VCF files use different genome coordinates (NCBI builds), you must use match=ID.
excludesamples=file
specifies a file containing samples to be excluded from the analysis (one sample identifier per line.

If the reference or target data contains samples with ancestry from more than continent, you should use the excludesamples parameter to limit the analysis to samples from a one continental population.

[ top ]


Algorithm

The conform-gt program first finds pairs of VCF records (one record each from the reference and target data) that correspond to the same variant. The match command line argument determine whether VCF records are matched using the ID field or the POS field. Two matched VCF records have an identical identifier (if match=ID) or an identical POS field (if match=POS). In either case, two records are matched only if the alleles in the target VCF record or in the strand-flipped target VCF record are a subset of the alleles in the reference record.

Once a target record is matched to the reference record, three sources of information are used to determine whether the target and reference chromosome strand are identical or opposite: alleles, REF allele frequency, and inter-marker correlation of REF allele dosage.

The target record chromosome strand is determined only if the allele frequency or the inter-marker correlation provide strong support for a particular orientation (identical or opposite) and if no two sources of information provide strong support for inconsistent orientations (e.g. one supporting identical and one supporting opposite orientation).

[ top ]


Output files

Two output files are produced:

Each line of the output log file has 10 tab-delimited fields.

Fields Description
1-5 [CHROM, POS, ID, REF, ALT]
First five fields of input target VCF record
6 [ALLELE_TEST]
Result when using alleles to determine chromosome strand
7 [FREQUENCY_TEST]
Result when using REF allele frequency to determine chromosome strand
8 [CORRELAT'N_TEST]
Result when using REF allele correlation to determine chromosome strand
9 [SUMMARY]
"PASS", "FAIL", or "REMOVED". Variants marked "PASS" are present in the output VCF file.
10 [INFO]
Chromosome strand relative to reference VCF, or reason for excluding the target variant

A target variant will be marked "REMOVED" in the SUMMARY field if it is not found in the reference VCF file, if it is a duplicate variant in the target VCF file, if it is out-of-order with respect to the reference VCF file, or if it is missing a genotype (GT) field. The corresponding INFO fields are "NOT_IN_REFERENCE", "DUPLICATE_VARIANT", "OUT_OF_ORDER", and "NO_GT_FIELD".

A target variant will be marked "FAIL" in the SUMMARY field if the chromosome strand evidence is inconsistent or inconclusive. The corresponding INFO fields are "INCONSISTENT_STRAND" and "UNKNOWN_STRAND".

A target variant will be marked "PASS" in the SUMMARY field if the variant is found in both the reference and target VCF files and if it can be determined whether the target variant alleles are defined on the same or opposite chromosome strand as the reference variant alleles. The corresponding INFO fields are "SAME_STRAND" and "OPPOSITE_STRAND".

If a target variant is out-of-order with respect to the reference VCF file and placed farther downstream in the target VCF file than it should be, the target variants between the correct and incorrect position will be reported to be out-of-order. In this case, exclude the misplaced record and re-run the conform-gt program.

[ top ]


Example

In this example, a target VCF file chr20.vcf.gz with European chromosome 20 data is modified so that it is consistent with the 1000 Genomes Project reference panel.

$ # Download sample information for 1000 Genomes Project
$ wget http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase1_vcf/phase1_integrated_calls.20101123.ALL.panel

$ # Download 1000 Genomes Project reference panel for chromosome 20
$ wget http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase1_vcf/chr20.1kg.ref.phase1_release_v3.20101123.vcf.gz

$ # Create a file with a list of non-European samples
$ grep -v EUR phase1_integrated_calls.20101123.ALL.panel | cut -f1 > non.eur.excl

$ # Run conform-gt program to make chr20.vcf.gz consistent with 1000 Genomes Project reference panel
$ java -jar conform-gt.jar ref=chr20.1kg.ref.phase1_release_v3.20101123.vcf.gz gt=chr20.vcf.gz chrom=20 out=mod.chr20 excludesamples=non.eur.excl

[ top ]


Download conform-gt

The conform-gt program is licensed under the Apache License, Version 2.0 (the License). You may not use the conform-gt program except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

Some source files in the net.sf.samtools package are licensed under the MIT License. See the source files for additional license information.

The conform-gt program is distributed on an "AS IS" BASIS WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

conform-gt.r1174.jar java executable file
conform-gt.r1174.zip source code
human reference panel 1000 Genomes Project data

[ top ]

Copyright: 2013 Brian L. Browning
Last updated: 18 Oct 2013.