Conform-gt

Program: conform-gt.24May16.cee.jar
Author: Brian Browning
Email: browning@uw.edu

Contents


Introduction

The conform-gt program modifies a target Variant Call Format (VCF) so that its records are consistent with a reference VCF file.

The reference and target VCF files are each required to have data for at least 20 individuals so that the statistical tests performed by the conform-gt program give reasonable results.

The conform-gt program will:

  1. Find corresponding VCF records in the target and reference files.
  2. Exclude target VCF records which cannot be matched to any reference VCF record.
  3. Exclude target VCF records whose chromosome strand cannot be determined. The conform-gt program makes use of allele frequency and inter-marker correlation to determine chromosome strand in ambiguous cases.
  4. Adjust target VCF records so that chromosome strand and allele order match the VCF reference file.

VCF records are permitted to have multiple non-reference alleles.

[ top ]


Running conform-gt

To use conform-gt, enter the following command at the command line prompt:

java -jar conform-gt.jar arguments

where arguments is a space-separated list of parameters, each expressed as parameter=value.

Required Parameters

ref=file
specifies the VCF file with reference genotype (GT) data.
gt=file
specifies the VCF file with target genotype (GT) data.
chrom=chrom or chrom=chrom:start-end
specifies a chromosome or chromosome interval: chrom is the chromosome identifier, start is a start position, and end is an end position. The chromosome identifier must be identical in target and reference files. Either start or end may be omitted if it corresponds to a chromosome end. Variants that are outside the chromosome or chromosome interval will be ignored.
out=prefix
specifies the prefix for output filenames.

Optional Parameters

match=ID or POS
specifies the field used to match reference and target records. If match=ID, the ID field will be used. If match=POS, the POS field will be used (default: match=ID). If the reference and target VCF files use different genome coordinates (NCBI builds), you must use match=ID.
excludesamples=file
specifies a file containing samples to be excluded (one sample identifier per line) when using allele frequency and correlation evidence to determine whether a reference and target VCF record reflect the same chromosome strand. If the reference or target data contains samples with ancestry from more than continent, you should use the excludesamples parameter to limit statistical analysis to samples from a one continental population.
strict=true or false
specifies the algorithm used to determine whether a reference and target VCF record reflect the same chromosome strand (default: strict=false). If strict=false, strand determination will be performed using only allele labels if that is possible, and minor allele frequency or correlation evidence will be used only if allele labels are ambiguous. If strict=true strand determination requires supporting minor allele frequency or correlation evidence.

[ top ]


Algorithm

The conform-gt program first finds pairs of VCF records (one record each from the reference and target data) that correspond to the same variant. The match command line argument determines whether VCF records are matched using the ID field or the POS field. Two matched VCF records have an identical identifier (if match=ID) or an identical POS field (if match=POS). In either case, two records are matched only if the alleles in the target VCF record or in the strand-flipped target VCF record are a subset of the alleles in the reference record.

Once a target record is matched to the reference record, three sources of information are used to determine whether the chromosome strand in a reference and a target VCF record are identical or opposite: alleles, REF allele frequency, and inter-marker correlation of REF allele dosage.

If strict=true the orientation of the chromosome strand in a reference and a target VCF record (identical or opposite) is determined only if a) allele frequency or inter-marker allele correlation provide strong support for a particular orientation (identical or opposite), and b) the evidence from each source of information (allele labels, frequency, and correlation) is consistent.

[ top ]


Output files

Two output files are produced:

Each line of the output log file has 10 tab-delimited fields.

Fields Description
1-5 [CHROM, POS, ID, REF, ALT]
First five fields of input target VCF record
6 ALLELE
Result when using allele labels to determine chromosome strand
7 FREQ
Result when using REF allele frequency to determine chromosome strand
8 R2
Result when using REF allele correlation to determine chromosome strand
9 SUMMARY
"PASS", "FAIL", or "REMOVED". Variants marked "PASS" are present in the output VCF file.
10 INFO
Chromosome strand relative to reference VCF, or reason for excluding the target variant

A target variant will be marked "REMOVED" in the SUMMARY field if it is not found in the reference VCF file, if there is more than one matching reference marker, if it is a duplicate variant in the target VCF file, if it is out-of-order with respect to the reference VCF file. The corresponding INFO fields are "NOT_IN_REFERENCE", "MULTIPLE_REF_MATCHES", "DUPLICATE_MARKER", and "MARKER_OUT_OF_ORDER".

A target variant will be marked "FAIL" in the SUMMARY field if the chromosome strand evidence is inconsistent or inconclusive. The corresponding INFO fields are "INCONSISTENT_STRAND" and "UNKNOWN_STRAND".

A target variant will be marked "PASS" in the SUMMARY field if the variant is found in both the reference and target VCF files and if it can be determined whether the target variant alleles are defined on the same or opposite chromosome strand as the reference variant alleles. The corresponding INFO fields are "SAME_STRAND" and "OPPOSITE_STRAND".

If a target variant is out-of-order with respect to the reference VCF file and placed farther downstream in the target VCF file than it should be, the target variants between the correct and incorrect position will be reported to be out-of-order. In this case, exclude the misplaced record and re-run the conform-gt program.

[ top ]


Example

In this example, a target VCF file chr20.vcf.gz with European chromosome 20 data is modified so that it is consistent with the 1000 Genomes Project reference panel.


# Download sample information for 1000 Genomes Project
wget https://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/sample_info/integrated_call_samples_v3.20130502.ALL.panel

# Download 1000 Genomes Project reference panel for chromosome 20
wget https://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/b37.vcf/chr20.1kg.phase3.v5a.vcf.gz


# Create a file with a list of non-European samples
grep -v EUR integrated_call_samples_v3.20130502.ALL.panel | cut -f1 > non.eur.excl

# Run conform-gt program to make chr20.vcf.gz consistent with 1000 Genomes Project reference panel
java -jar conform-gt.24May16.cee.jar ref=chr20.1kg.phase3.v5a.vcf.gz gt=chr20.vcf.gz chrom=20 out=mod.chr20 excludesamples=non.eur.excl

[ top ]


Download conform-gt

The conform-gt program is licensed under the Apache License, Version 2.0 (the License). You may not use the conform-gt program except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.

Some source files in the net.sf.samtools package are licensed under the MIT License. See the source files for additional license information.

The conform-gt program is distributed on an "AS IS" BASIS WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

conform-gt.24May16.cee.jar java executable file
conform-gt.24May16.cee.src.zip source code
human reference panel 1000 Genomes Project data

[ top ]

Copyright: 2013-2016 Brian L. Browning
Last updated: 03 Dec 2019