Parsing genotype arrays in gnomAD .vcf files without hail¶
There have been a few documented changes to gnomAD vcf. This excerpt describes one of the biggest changes, the genotype count arrays:
"We have removed the Het fields as they were hard to parse and instead provide genotype counts (GC), or a G-indexed array of counts of individuals per genotype. As mentioned in the Variant QC section, we have fundamentally switched from a site-level filtering strategy to an allele-specific filtering strategy."
While the order of the genotypes in these "G-indexed" arrays can be easily inferred for biallelic sites, it becomes more difficult to deduce at multiallelic sites. My co-worker, Julie, deserves credit for the inspiration for this post and special thanks for the back-and-forth conversation before she arrived at a working solution. We think using hail would alleviate the problem because it was the software used to work with the gnomAD data and it has an extensive framework for working with variants. The solution outlined below is a pure Python method to assign genotype labels to counts from a raw gnomAD .vcf.
To follow along, grab the chromosome 22 vcf and index file (or from the gnomAD downloads page)
# grab the gnomAD chromosome 22 vcf.
! wget https://storage.googleapis.com/gnomad-public/release-170228/coverage/genomes/gnomad.chr22.cov.txt.gz
! wget https://storage.googleapis.com/gnomad-public/release-170228/coverage/genomes/gnomad.chr22.cov.txt.gz.tbi
import cyvcf2
import pandas as pd # for visualization in the notebook.
import numpy as np
# cyvcf2 wraps htslib for efficient parsing:
gnomad22vcf = cyvcf2.VCF('gnomad.genomes.r2.0.1.sites.22.vcf.gz')
Here's a generator function that will return a variant from the gnomAD .vcf based on some filtering criteria. variant_generator
simply iterates through the gnomAD .vcf one record per iteration. The multiallelic
argument serves as a boolean flag because we'll be working with both biallelic and multiallelic variants below. The alt_af
parameter ensures that the returned variant will have some allele diversity. Rare variants serve as poor examples because some alternate alleles will have 0
, so there won't be any alleles to count. The variant_generator
function isn't intended to be sophisticated or useful beyond the scope of this blog post, it just pulls relevant variants for us to explore.
def variant_generator(vcf, multiallelic=False, alt_af=0.05):
"""
Iterate through .vcf file.
Skip FILTER variants
restrict to missense.
Return variants the exceed some alternate allele frequency .
"""
for _var in vcf:
if _var.FILTER:
# skip FILTER variants
continue
if 'missense' not in _var.INFO['CSQ']:
# do your best to skip non-exonic variants
continue
if multiallelic:
if (len(_var.ALT) > 1 and
any(0.4 > x > alt_af for x in _var.INFO['AF'])):
yield _var
else: # biallelic
if (len(_var.ALT) == 1 and
_var.INFO['AF'] > alt_af < 0.4):
yield _var
For posterity, we'll only look at one of the global sub-popuplations. Non-Finnish European, denoted by NFE
, is a good choice for this exercise because that group has a large sample size. The first example will start with a biallelic variant, as these have less complexity than multiallelic counterparts.
variant = next(variant_generator(gnomad22vcf))
variant
gnomAD bialleleic genotypes breakdown¶
In the gnomAD .vcf, the GC_NFE
field shows the aforementioned "G-indexed" array (from the gnomAD blog post). The order of "each genotype" is not immediately obvious, but we can infer the order by visiting the variant page (and/or by calculating below).
variant.INFO['GC_NFE']
These are genotype counts. In short, the genotype at a given loci is defined by the allele composition. For this particular variant the possible alleles are G
and A
and the possbible genotypes for those alleles are the combinations of the alleles. The Punnett square figure below provides a visual representation of the difference between a genotype and allele, think of A
and G
instead of B
and b
.
image from: http://www.wikihow.com/Use-a-Punnett-Square-to-Do-a-Monohybrid-Cross
One observation is that there are only 3 counts in the genotype count array, but there are four cells in the Punnett square. One of those counts represents both heterozygous genotypes, those two orange cells in the square. The question we're interested in is can we reproduce the alternate allele frequency from the genotype counts array?
# alternate allele count for NFE
variant.INFO['AC_NFE']
# same genotype counts array from above
variant.INFO['GC_NFE']
Take the second second element and add it with double the third element. The third element is doubled because
we're assuming it represents homozygous alternate allele state, or AA
.
3430 + (720*2)
So far it appears that the second element represents heterozygous variant counts and the third element represents homozygous variant counts. Let's check that the AN_NFE
matches twice the nubmer of all the genotypes in GC_NFE
.
# sum up the genotypes array from the variant record & multiply by 2
alleleN = sum(variant.INFO['GC_NFE']) * 2
print(alleleN)
Does the manual sum equal the gnomad .vcf AN_NFE
(Allele Number)?
variant.INFO['AN_NFE'] == alleleN
Now manually check the allele frequency.
(3430 + (720*2)) / alleleN
variant.INFO['AF_NFE']
Great, push ahead under the assumption that the genotype counts array is correct ~> (hom_ref, het_alt, hom_alt)
ref = variant.REF
alt = variant.ALT
gcounts_array = variant.INFO['GC_NFE']
print(gcounts_array)
alleles = alt
alleles.insert(0, ref)
print(alleles)
Set up a pandas DataFrame to visualize the genotypes and their counts -- this will resemble the Punnett square above.
df = pd.DataFrame(np.zeros((len(alleles),len(alleles))), index=alleles, columns=alleles)
n_alleles = len(alleles)
allele_order = np.tril_indices(n_alleles)
#https://docs.scipy.org/doc/numpy/reference/generated/numpy.tril_indices.html
df.values[allele_order] = variant.INFO['GC_NFE']
df
gnomAD multialleleic genotypes breakdown¶
The biallelic variant was fairly straight-forward. Where this array conversion really comes in handy is the multiallelic sites. Use the variant_generator
to grab the next variant, but change multiallelic=True
. Also, loop through the .vcf until we reach a variant with more than 2 alternate alleles, this will help show the benefit of deconstructing the added complexity that comes with multiallelic sites.
# will run for a minute or so
for variant in variant_generator(gnomad22vcf, multiallelic=True):
if len(variant.ALT) > 2:
break
variant
ref = variant.REF
alt = variant.ALT
gcounts_array = variant.INFO['GC_NFE']
print(gcounts_array)
That got confusing fast! The numpy method, np.tril_indices
, trick should help us obtain accurate genotypes.
# same code as above -- just applied to a new (multiallelic) variant
alleles = alt
alleles.insert(0, ref)
print(alleles)
df = pd.DataFrame(np.zeros((len(alleles),len(alleles))), index=alleles, columns=alleles)
n_alleles = len(alleles)
allele_order = np.tril_indices(n_alleles)
df.values[allele_order] = variant.INFO['GC_NFE']
There are 5,205
A
alleles in NFE
at this site according to the gnomAD variant page for G>A. Add up the cells with A
alleles, remembering to double the homozgyous cell.
df.style.apply(lambda x: [('background: lightgreen' if x.name == 'A' and _ > 0
else ('background: lightgreen' if i == 'A' and _ > 0 else ''))
for i,_ in x.iteritems()])
# manual calculation
3197 + (935)*2 + 136 + 2
This post can really be distilled down to assigning genotype labels to the genotype count array from the gnomAD vcf. Here's a bare-bones function to do that.
def gnomAD_Gindex(ref, alt):
"""
Assigns genotype labels to a gnomAD genotype count arrray.
"""
import numpy as np
alleles = alt
alleles.insert(0, ref)
genotype_labels = []
for a1,a2 in zip(*np.tril_indices(len(alleles))):
genotype_labels.append(alleles[a1] + alleles[a2])
return genotype_labels
gt_labels = gnomAD_Gindex(variant.REF, variant.ALT)
gt_labels
for gt, gtcount in zip(gt_labels, variant.INFO['GC_NFE']):
print('genotype {} has {} individuals'.format(gt, gtcount))
Comments are welcome. This was an exploratory exercise over a long weekend, so there could be elements I'm overlooking. Thanks for reading!
Comments
comments powered by Disqus