The conform-gt program modifies a target Variant Call Format (VCF) so that its records are consistent with a reference VCF file.
The target VCF file is required to have data for at least 20 individuals so that statistical tests performed by the conform-gt give reasonable results.
The conform-gt program will:
VCF records are permitted to have multiple non-reference alleles.
[ top ]
To use conform-gt, enter the following command at the command line prompt:
java -jar conform-gt.jar arguments
where arguments is a space-separated list of parameters, each expressed as parameter=value.
chromis the chromosome identifier in the VCF files, and
endare the interval start and end positions. Either
endmay be omitted if they correspond to a chromosome end. Variants in the reference or target VCF files that are outside the specified chromosome interval will be ignored.
match=ID, the ID field will be used. If
match=POS, the POS field will be used (default:
match=ID). If the reference and target VCF files use different genome coordinates (NCBI builds), you must use
If the reference or target data contains samples
with ancestry from more than continent, you should use the
excludesamples parameter to limit the analysis
to samples from a one continental population.
[ top ]
The conform-gt program first finds pairs of
VCF records (one record each from the reference and target data)
that correspond to the same variant.
match command line argument determine whether
VCF records are matched using the ID field or the POS field.
Two matched VCF records have an identical identifier
match=ID) or an identical POS field
match=POS). In either case, two records are
matched only if the alleles in the target VCF record or in the
strand-flipped target VCF record are a subset of the alleles in the
Once a target record is matched to the reference record, three sources of information are used to determine whether the target and reference chromosome strand are identical or opposite: alleles, REF allele frequency, and inter-marker correlation of REF allele dosage.
The target record chromosome strand is determined only if the allele frequency or the inter-marker correlation provide strong support for a particular orientation (identical or opposite) and if no two sources of information provide strong support for inconsistent orientations (e.g. one supporting identical and one supporting opposite orientation).
[ top ]
Two output files are produced:
Each line of the output log file has 10 tab-delimited fields.
|1-5||[CHROM, POS, ID, REF, ALT]
First five fields of input target VCF record
Result when using alleles to determine chromosome strand
Result when using REF allele frequency to determine chromosome strand
Result when using REF allele correlation to determine chromosome strand
"PASS", "FAIL", or "REMOVED". Variants marked "PASS" are present in the output VCF file.
Chromosome strand relative to reference VCF, or reason for excluding the target variant
A target variant will be marked "REMOVED" in the SUMMARY field if it is not found in the reference VCF file, if it is a duplicate variant in the target VCF file, if it is out-of-order with respect to the reference VCF file, or if it is missing a genotype (GT) field. The corresponding INFO fields are "NOT_IN_REFERENCE", "DUPLICATE_VARIANT", "OUT_OF_ORDER", and "NO_GT_FIELD".
A target variant will be marked "FAIL" in the SUMMARY field if the chromosome strand evidence is inconsistent or inconclusive. The corresponding INFO fields are "INCONSISTENT_STRAND" and "UNKNOWN_STRAND".
A target variant will be marked "PASS" in the SUMMARY field if the variant is found in both the reference and target VCF files and if it can be determined whether the target variant alleles are defined on the same or opposite chromosome strand as the reference variant alleles. The corresponding INFO fields are "SAME_STRAND" and "OPPOSITE_STRAND".
If a target variant is out-of-order with respect to the reference VCF file and placed farther downstream in the target VCF file than it should be, the target variants between the correct and incorrect position will be reported to be out-of-order. In this case, exclude the misplaced record and re-run the conform-gt program.
[ top ]
In this example, a target VCF file
European chromosome 20 data is modified so that it is consistent with the
1000 Genomes Project reference panel.
$ # Download sample information for 1000 Genomes Project $ wget http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase1_vcf/phase1_integrated_calls.20101123.ALL.panel $ # Download 1000 Genomes Project reference panel for chromosome 20 $ wget http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase1_vcf/chr20.1kg.ref.phase1_release_v3.20101123.vcf.gz $ # Create a file with a list of non-European samples $ grep -v EUR phase1_integrated_calls.20101123.ALL.panel | cut -f1 > non.eur.excl $ # Run conform-gt program to make chr20.vcf.gz consistent with 1000 Genomes Project reference panel $ java -jar conform-gt.jar ref=chr20.1kg.ref.phase1_release_v3.20101123.vcf.gz gt=chr20.vcf.gz chrom=20 out=mod.chr20 excludesamples=non.eur.excl
[ top ]
The conform-gt program is licensed under the Apache License, Version 2.0 (the License). You may not use the conform-gt program except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.
Some source files in the
are licensed under the
MIT License. See the source files for additional
The conform-gt program is distributed on an "AS IS" BASIS WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
|conform-gt.r1174.jar||java executable file|
|human reference panel||1000 Genomes Project data|
[ top ]
Copyright: 2013 Brian L. Browning
Last updated: 18 Oct 2013.