Parsing genotype arrays in gnomAD .vcf files without hail¶

There have been a few documented changes to gnomAD vcf. This excerpt describes one of the biggest changes, the genotype count arrays:

"We have removed the Het fields as they were hard to parse and instead provide genotype counts (GC), or a G-indexed array of counts of individuals per genotype. As mentioned in the Variant QC section, we have fundamentally switched from a site-level filtering strategy to an allele-specific filtering strategy."

While the order of the genotypes in these "G-indexed" arrays can be easily inferred for biallelic sites, it becomes more difficult to deduce at multiallelic sites. My co-worker, Julie, deserves credit for the inspiration for this post and special thanks for the back-and-forth conversation before she arrived at a working solution. We think using hail would alleviate the problem because it was the software used to work with the gnomAD data and it has an extensive framework for working with variants. The solution outlined below is a pure Python method to assign genotype labels to counts from a raw gnomAD .vcf.

To follow along, grab the chromosome 22 vcf and index file (or from the gnomAD downloads page)

In [ ]:

# grab the gnomAD chromosome 22 vcf.
! wget https://storage.googleapis.com/gnomad-public/release-170228/coverage/genomes/gnomad.chr22.cov.txt.gz
! wget https://storage.googleapis.com/gnomad-public/release-170228/coverage/genomes/gnomad.chr22.cov.txt.gz.tbi

I'll be using Python to decode the genotype counts, in particular leaning on cyvcf2 for variant parsing and numpy for working with the genotype count data.

In [1]:

import cyvcf2
import pandas as pd # for visualization in the notebook.
import numpy as np

In [2]:

# cyvcf2 wraps htslib for efficient parsing:
gnomad22vcf = cyvcf2.VCF('gnomad.genomes.r2.0.1.sites.22.vcf.gz')

Here's a generator function that will return a variant from the gnomAD .vcf based on some filtering criteria. variant_generator simply iterates through the gnomAD .vcf one record per iteration. The multiallelic argument serves as a boolean flag because we'll be working with both biallelic and multiallelic variants below. The alt_af parameter ensures that the returned variant will have some allele diversity. Rare variants serve as poor examples because some alternate alleles will have 0, so there won't be any alleles to count. The variant_generator function isn't intended to be sophisticated or useful beyond the scope of this blog post, it just pulls relevant variants for us to explore.

In [3]:

def variant_generator(vcf, multiallelic=False, alt_af=0.05):
    """
    Iterate through .vcf file.
    Skip FILTER variants
    restrict to missense.
    Return variants the exceed some alternate allele frequency .
    """
    for _var in vcf:
        if _var.FILTER:
            # skip FILTER variants
            continue
        if 'missense' not in _var.INFO['CSQ']:
            # do your best to skip non-exonic variants
            continue
            
        if multiallelic:
            if (len(_var.ALT) > 1 and
                any(0.4 > x > alt_af for x in _var.INFO['AF'])):
                yield _var
        else: # biallelic
            if (len(_var.ALT) == 1 and
                _var.INFO['AF'] > alt_af < 0.4):
                yield _var

For posterity, we'll only look at one of the global sub-popuplations. Non-Finnish European, denoted by NFE, is a good choice for this exercise because that group has a large sample size. The first example will start with a biallelic variant, as these have less complexity than multiallelic counterparts.

In [4]:

variant = next(variant_generator(gnomad22vcf))

In [5]:

variant

Out[5]:

Variant(22:16449050 G/A)

gnomAD bialleleic genotypes breakdown¶

In the gnomAD .vcf, the GC_NFE field shows the aforementioned "G-indexed" array (from the gnomAD blog post). The order of "each genotype" is not immediately obvious, but we can infer the order by visiting the variant page (and/or by calculating below).

In [7]:

variant.INFO['GC_NFE']

Out[7]:

(3180, 3430, 720)

These are genotype counts. In short, the genotype at a given loci is defined by the allele composition. For this particular variant the possible alleles are G and A and the possbible genotypes for those alleles are the combinations of the alleles. The Punnett square figure below provides a visual representation of the difference between a genotype and allele, think of A and G instead of B and b.
image from: http://www.wikihow.com/Use-a-Punnett-Square-to-Do-a-Monohybrid-Cross

One observation is that there are only 3 counts in the genotype count array, but there are four cells in the Punnett square. One of those counts represents both heterozygous genotypes, those two orange cells in the square. The question we're interested in is can we reproduce the alternate allele frequency from the genotype counts array?

In [8]:

# alternate allele count for NFE
variant.INFO['AC_NFE']

Out[8]:

In [9]:

# same genotype counts array from above
variant.INFO['GC_NFE']

Out[9]:

(3180, 3430, 720)

Take the second second element and add it with double the third element. The third element is doubled because we're assuming it represents homozygous alternate allele state, or AA.

In [10]:

3430 + (720*2)

Out[10]:

So far it appears that the second element represents heterozygous variant counts and the third element represents homozygous variant counts. Let's check that the AN_NFE matches twice the nubmer of all the genotypes in GC_NFE.

In [11]:

# sum up the genotypes array from the variant record & multiply by 2
alleleN = sum(variant.INFO['GC_NFE']) * 2
print(alleleN)

Does the manual sum equal the gnomad .vcf AN_NFE (Allele Number)?

In [12]:

variant.INFO['AN_NFE'] == alleleN

Out[12]:

True

Now manually check the allele frequency.

In [13]:

(3430 + (720*2)) / alleleN

Out[13]:

0.3321964529331514

In [14]:

variant.INFO['AF_NFE']

Out[14]:

0.3321959972381592

Great, push ahead under the assumption that the genotype counts array is correct ~> (hom_ref, het_alt, hom_alt)

In [15]:

ref = variant.REF
alt = variant.ALT
gcounts_array = variant.INFO['GC_NFE']
print(gcounts_array)

(3180, 3430, 720)

In [16]:

alleles = alt
alleles.insert(0, ref)
print(alleles)

['G', 'A']

Set up a pandas DataFrame to visualize the genotypes and their counts -- this will resemble the Punnett square above.

In [17]:

df = pd.DataFrame(np.zeros((len(alleles),len(alleles))), index=alleles, columns=alleles)

In [18]:

n_alleles = len(alleles)
allele_order = np.tril_indices(n_alleles)
#https://docs.scipy.org/doc/numpy/reference/generated/numpy.tril_indices.html

In [19]:

df.values[allele_order] = variant.INFO['GC_NFE']
df

Out[19]:

	G	A
G	3180.0	0.0
A	3430.0	720.0

gnomAD multialleleic genotypes breakdown¶

The biallelic variant was fairly straight-forward. Where this array conversion really comes in handy is the multiallelic sites. Use the variant_generator to grab the next variant, but change multiallelic=True. Also, loop through the .vcf until we reach a variant with more than 2 alternate alleles, this will help show the benefit of deconstructing the added complexity that comes with multiallelic sites.

In [20]:

# will run for a minute or so
for variant in variant_generator(gnomad22vcf, multiallelic=True):
    if len(variant.ALT) > 2:
        break

In [21]:

variant

Out[21]:

Variant(22:46449891 G/A,C,T)

In [22]:

ref = variant.REF
alt = variant.ALT
gcounts_array = variant.INFO['GC_NFE']
print(gcounts_array)

(2871, 3197, 935, 252, 136, 2, 0, 2, 0, 0)

That got confusing fast! The numpy method, np.tril_indices, trick should help us obtain accurate genotypes.

In [23]:

# same code as above -- just applied to a new (multiallelic) variant
alleles = alt
alleles.insert(0, ref)
print(alleles)

df = pd.DataFrame(np.zeros((len(alleles),len(alleles))), index=alleles, columns=alleles)

n_alleles = len(alleles)
allele_order = np.tril_indices(n_alleles)

df.values[allele_order] = variant.INFO['GC_NFE']

['G', 'A', 'C', 'T']

There are 5,205 A alleles in NFE at this site according to the gnomAD variant page for G>A. Add up the cells with A alleles, remembering to double the homozgyous cell.

In [24]:

df.style.apply(lambda x: [('background: lightgreen' if x.name == 'A' and _ > 0
                            else ('background: lightgreen' if i == 'A' and _ > 0 else ''))
                            for i,_ in x.iteritems()])

Out[24]:

	G	A	C
G	2871	0	0
A	3197	935	0
C	252	136	2
T	0	2	0

In [25]:

# manual calculation
3197 + (935)*2 + 136 + 2

Out[25]:

That's accurate, to check other's you can use the three gnomAD variant pages to check that the allele counts match.
G>A
G>C
G>T

This post can really be distilled down to assigning genotype labels to the genotype count array from the gnomAD vcf. Here's a bare-bones function to do that.

In [26]:

def gnomAD_Gindex(ref, alt):
    """
    Assigns genotype labels to a gnomAD genotype count arrray.
    """
    import numpy as np
    alleles = alt
    alleles.insert(0, ref)
    
    genotype_labels = []
    for a1,a2 in zip(*np.tril_indices(len(alleles))):
        genotype_labels.append(alleles[a1] + alleles[a2])
        
    return genotype_labels

In [27]:

gt_labels = gnomAD_Gindex(variant.REF, variant.ALT)

In [28]:

gt_labels

Out[28]:

['GG', 'AG', 'AA', 'CG', 'CA', 'CC', 'TG', 'TA', 'TC', 'TT']

In [29]:

for gt, gtcount in zip(gt_labels, variant.INFO['GC_NFE']):
    print('genotype {} has {} individuals'.format(gt, gtcount))

genotype GG has 2871 individuals
genotype AG has 3197 individuals
genotype AA has 935 individuals
genotype CG has 252 individuals
genotype CA has 136 individuals
genotype CC has 2 individuals
genotype TG has 0 individuals
genotype TA has 2 individuals
genotype TC has 0 individuals
genotype TT has 0 individuals

Comments are welcome. This was an exploratory exercise over a long weekend, so there could be elements I'm overlooking. Thanks for reading!