The conform-gt program modifies a target Variant Call Format (VCF) so that its records are consistent with a reference VCF file.
The reference and target VCF files are each required to have data for at least 20 individuals so that the statistical tests performed by the conform-gt program give reasonable results.
The conform-gt program will:
VCF records are permitted to have multiple non-reference alleles.
[ top ]
To use conform-gt, enter the following command at the command line prompt:
java -jar conform-gt.jar arguments
where arguments is a space-separated list of parameters, each expressed as parameter=value.
chromis the chromosome identifier,
startis a start position, and
endis an end position. The chromosome identifier must be identical in target and reference files. Either
endmay be omitted if it corresponds to a chromosome end. Variants that are outside the chromosome or chromosome interval will be ignored.
match=ID, the ID field will be used. If
match=POS, the POS field will be used (default:
match=ID). If the reference and target VCF files use different genome coordinates (NCBI builds), you must use
excludesamplesparameter to limit statistical analysis to samples from a one continental population.
strict=false, strand determination will be performed using only allele labels if that is possible, and minor allele frequency or correlation evidence will be used only if allele labels are ambiguous. If
strict=truestrand determination requires supporting minor allele frequency or correlation evidence.
[ top ]
The conform-gt program first finds pairs of
VCF records (one record each from the reference and target data)
that correspond to the same variant.
match command line argument determines whether
VCF records are matched using the ID field or the POS field.
Two matched VCF records have an identical identifier
match=ID) or an identical POS field
match=POS). In either case, two records are
matched only if the alleles in the target VCF record or in the
strand-flipped target VCF record are a subset of the alleles in the
Once a target record is matched to the reference record, three sources of information are used to determine whether the chromosome strand in a reference and a target VCF record are identical or opposite: alleles, REF allele frequency, and inter-marker correlation of REF allele dosage.
If strict=true the orientation of the chromosome strand in a reference and a target VCF record (identical or opposite) is determined only if a) allele frequency or inter-marker allele correlation provide strong support for a particular orientation (identical or opposite), and b) the evidence from each source of information (allele labels, frequency, and correlation) is consistent.
[ top ]
Two output files are produced:
Each line of the output log file has 10 tab-delimited fields.
|1-5||[CHROM, POS, ID, REF, ALT]
First five fields of input target VCF record
Result when using allele labels to determine chromosome strand
Result when using REF allele frequency to determine chromosome strand
Result when using REF allele correlation to determine chromosome strand
"PASS", "FAIL", or "REMOVED". Variants marked "PASS" are present in the output VCF file.
Chromosome strand relative to reference VCF, or reason for excluding the target variant
A target variant will be marked "REMOVED" in the SUMMARY field if it is not found in the reference VCF file, if there is more than one matching reference marker, if it is a duplicate variant in the target VCF file, if it is out-of-order with respect to the reference VCF file. The corresponding INFO fields are "NOT_IN_REFERENCE", "MULTIPLE_REF_MATCHES", "DUPLICATE_MARKER", and "MARKER_OUT_OF_ORDER".
A target variant will be marked "FAIL" in the SUMMARY field if the chromosome strand evidence is inconsistent or inconclusive. The corresponding INFO fields are "INCONSISTENT_STRAND" and "UNKNOWN_STRAND".
A target variant will be marked "PASS" in the SUMMARY field if the variant is found in both the reference and target VCF files and if it can be determined whether the target variant alleles are defined on the same or opposite chromosome strand as the reference variant alleles. The corresponding INFO fields are "SAME_STRAND" and "OPPOSITE_STRAND".
If a target variant is out-of-order with respect to the reference VCF file and placed farther downstream in the target VCF file than it should be, the target variants between the correct and incorrect position will be reported to be out-of-order. In this case, exclude the misplaced record and re-run the conform-gt program.
[ top ]
In this example, a target VCF file
European chromosome 20 data is modified so that it is consistent with the
1000 Genomes Project reference panel.
$ # Download sample information for 1000 Genomes Project $ wget http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase1_vcf/phase1_integrated_calls.20101123.ALL.panel $ # Download 1000 Genomes Project reference panel for chromosome 20 $ wget http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase1_vcf/chr20.1kg.ref.phase1_release_v3.20101123.vcf.gz $ # Create a file with a list of non-European samples $ grep -v EUR phase1_integrated_calls.20101123.ALL.panel | cut -f1 > non.eur.excl $ # Run conform-gt program to make chr20.vcf.gz consistent with 1000 Genomes Project reference panel $ java -jar conform-gt.24May16.cee.jar ref=chr20.1kg.ref.phase1_release_v3.20101123.vcf.gz gt=chr20.vcf.gz chrom=20 out=mod.chr20 excludesamples=non.eur.excl
[ top ]
The conform-gt program is licensed under the Apache License, Version 2.0 (the License). You may not use the conform-gt program except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.
Some source files in the
are licensed under the
MIT License. See the source files for additional
The conform-gt program is distributed on an "AS IS" BASIS WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
|conform-gt.24May16.cee.jar||java executable file|
|human reference panel||1000 Genomes Project data|
[ top ]
Copyright: 2013-2016 Brian L. Browning
Last updated: 24 May 2016