Program: | conform-gt.24May16.cee.jar |
Author: | Brian Browning |
Email: | browning@uw.edu |
The conform-gt program modifies a target Variant Call Format (VCF) so that its records are consistent with a reference VCF file.
The reference and target VCF files are each required to have data for at least 20 individuals so that the statistical tests performed by the conform-gt program give reasonable results.
The conform-gt program will:
VCF records are permitted to have multiple non-reference alleles.
[ top ]
To use conform-gt, enter the following command at the command line prompt:
java -jar conform-gt.jar
arguments
where arguments is a space-separated list of parameters, each expressed as parameter=value.
chrom
is the chromosome identifier,
start
is a start position, and
end
is an end position. The chromosome identifier
must be identical in target and reference files.
Either start
or end
may be omitted
if it corresponds to a chromosome end. Variants that are
outside the chromosome or chromosome interval will be ignored.match=ID
,
the ID field will be used. If match=POS
,
the POS field will be used (default: match=ID)
.
If the reference and target VCF files use different genome
coordinates (NCBI builds), you must use match=ID
.
excludesamples
parameter to limit statistical
analysis to samples from a one continental population.
strict=false
).
If strict=false
, strand determination
will be performed using only allele labels if that is possible, and
minor allele frequency or correlation evidence will be used
only if allele labels are ambiguous.
If strict=true
strand determination requires
supporting minor allele frequency or correlation evidence.
[ top ]
The conform-gt program first finds pairs of
VCF records (one record each from the reference and target data)
that correspond to the same variant.
The match
command line argument determines whether
VCF records are matched using the ID field or the POS field.
Two matched VCF records have an identical identifier
(if match=ID
) or an identical POS field
(if match=POS
). In either case, two records are
matched only if the alleles in the target VCF record or in the
strand-flipped target VCF record are a subset of the alleles in the
reference record.
Once a target record is matched to the reference record, three sources of information are used to determine whether the chromosome strand in a reference and a target VCF record are identical or opposite: alleles, REF allele frequency, and inter-marker correlation of REF allele dosage.
If strict=true the orientation of the chromosome strand in a reference and a target VCF record (identical or opposite) is determined only if a) allele frequency or inter-marker allele correlation provide strong support for a particular orientation (identical or opposite), and b) the evidence from each source of information (allele labels, frequency, and correlation) is consistent.
[ top ]
Two output files are produced:
Each line of the output log file has 10 tab-delimited fields.
Fields | Description |
---|---|
1-5 | [CHROM, POS, ID, REF, ALT]
First five fields of input target VCF record |
6 | ALLELE
Result when using allele labels to determine chromosome strand |
7 | FREQ
Result when using REF allele frequency to determine chromosome strand |
8 | R2
Result when using REF allele correlation to determine chromosome strand |
9 | SUMMARY
"PASS", "FAIL", or "REMOVED". Variants marked "PASS" are present in the output VCF file. |
10 | INFO
Chromosome strand relative to reference VCF, or reason for excluding the target variant |
A target variant will be marked "REMOVED" in the SUMMARY field if it is not found in the reference VCF file, if there is more than one matching reference marker, if it is a duplicate variant in the target VCF file, if it is out-of-order with respect to the reference VCF file. The corresponding INFO fields are "NOT_IN_REFERENCE", "MULTIPLE_REF_MATCHES", "DUPLICATE_MARKER", and "MARKER_OUT_OF_ORDER".
A target variant will be marked "FAIL" in the SUMMARY field if the chromosome strand evidence is inconsistent or inconclusive. The corresponding INFO fields are "INCONSISTENT_STRAND" and "UNKNOWN_STRAND".
A target variant will be marked "PASS" in the SUMMARY field if the variant is found in both the reference and target VCF files and if it can be determined whether the target variant alleles are defined on the same or opposite chromosome strand as the reference variant alleles. The corresponding INFO fields are "SAME_STRAND" and "OPPOSITE_STRAND".
If a target variant is out-of-order with respect to the reference VCF file and placed farther downstream in the target VCF file than it should be, the target variants between the correct and incorrect position will be reported to be out-of-order. In this case, exclude the misplaced record and re-run the conform-gt program.
[ top ]
In this example, a target VCF file chr20.vcf.gz
with
European chromosome 20 data is modified so that it is consistent with the
1000 Genomes Project reference panel.
# Download sample information for 1000 Genomes Project
wget https://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/sample_info/integrated_call_samples_v3.20130502.ALL.panel
# Download 1000 Genomes Project reference panel for chromosome 20
wget https://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/b37.vcf/chr20.1kg.phase3.v5a.vcf.gz
# Create a file with a list of non-European samples
grep -v EUR integrated_call_samples_v3.20130502.ALL.panel | cut -f1 > non.eur.excl
# Run conform-gt program to make chr20.vcf.gz consistent with 1000 Genomes Project reference panel
java -jar conform-gt.24May16.cee.jar ref=chr20.1kg.phase3.v5a.vcf.gz gt=chr20.vcf.gz chrom=20 out=mod.chr20 excludesamples=non.eur.excl
[ top ]
The conform-gt program is licensed under the Apache License, Version 2.0 (the License). You may not use the conform-gt program except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.
Some source files in the net.sf.samtools
package
are licensed under the
MIT License. See the source files for additional
license information.
The conform-gt program is distributed on an "AS IS" BASIS WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
conform-gt.24May16.cee.jar | java executable file |
conform-gt.24May16.cee.src.zip | source code |
human reference panel | 1000 Genomes Project data |
[ top ]
Copyright: 2013-2016 Brian L. Browning
Last updated: 03 Dec 2019