Calculating credible intervals¶

When we combine datasets, by default, we assume that each one is from the same population and that there is no sampling variance. If these assumptions are violated it would lead to overly tight confidence intervals. In reality these assumptions are rarely met and so by default HLAfreq does not supply credible intervals.

However, if we model this variance we can obtain accurate credible intervals for the combined allele frequencies. We can do this using AFhdi() from the HLAfreq.HLAfreq_pymc module which we import as HLAhdi below.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import HLAfreq
from HLAfreq import HLAfreq_pymc as HLAhdi
WARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.
In [2]:
# Download dataset
country = "Mongolia"
locus = "DQB1"

base_url = HLAfreq.makeURL(country, locus=locus)
aftab = HLAfreq.getAFdata(base_url)
aftab = HLAfreq.only_complete(aftab)
HLAfreq.check_resolution(aftab)
aftab = HLAfreq.decrease_resolution(aftab, 2)
caf = HLAfreq.combineAF(aftab)
1 pages of results
Download complete
2    64
3     3
Name: allele, dtype: int64
Multiple resolutions in AFtab. Fix with decrease_resolution()
In [3]:
# Calculate high density intervals
hdi = HLAhdi.AFhdi(aftab, credible_interval=0.95)
# Add to the combined allele frequency dataset
caf = pd.merge(caf, hdi, how="left", on="allele")
caf[['allele', 'allele_freq', 'lo','hi','post_mean']]
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [frac, conc]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 7 seconds.
Out[3]:
allele allele_freq lo hi post_mean
0 DQB1*02:01 0.102965 0.070885 0.141614 0.107
1 DQB1*02:02 0.062476 0.026928 0.080378 0.052
2 DQB1*03:01 0.271881 0.216031 0.325208 0.274
3 DQB1*03:02 0.045532 0.023916 0.069979 0.044
4 DQB1*03:03 0.128528 0.065484 0.140824 0.102
5 DQB1*03:04 0.001752 0.000079 0.010115 0.004
6 DQB1*03:13 0.001752 0.000117 0.010007 0.004
7 DQB1*04:01 0.038527 0.016878 0.059678 0.038
8 DQB1*04:02 0.021780 0.012336 0.048523 0.031
9 DQB1*05:01 0.060791 0.035718 0.09095 0.062
10 DQB1*05:02 0.039710 0.020024 0.065927 0.042
11 DQB1*05:03 0.020403 0.00555 0.033167 0.018
12 DQB1*05:04 0.001170 0.000059 0.008448 0.004
13 DQB1*05:42 0.001170 0.000095 0.008946 0.004
14 DQB1*06:01 0.065409 0.03638 0.091338 0.063
15 DQB1*06:02 0.071234 0.03162 0.088836 0.061
16 DQB1*06:03 0.029325 0.018884 0.062921 0.04
17 DQB1*06:04 0.018674 0.010062 0.04377 0.026
18 DQB1*06:05 0.001165 0.00005 0.008732 0.004
19 DQB1*06:07 0.001743 0.000112 0.009819 0.004
20 DQB1*06:08 0.001165 0.000012 0.008881 0.004
21 DQB1*06:09 0.012848 0.001486 0.023908 0.012
In [4]:
HLAfreq.plotAF(caf, aftab, hdi=hdi)
No description has been provided for this image

Above, we have calculated the 95% high density credible interval and added it to the combined allele frequency dataset. Note that by setting the credible_interval argument you can calculate any credible interval.

This has also added a column post_mean which is the mean of the posterior distribution for allele frequency estimated by the variance model. In general this is very similar to the allele_freq estimated by default. The difference is that it takes account of the variances mentioned before. However, in some cases the difference can become large and a warning is produced. Below we discuss when this can happen and what to do about it.

Explaining differences between allele_freq and post_mean¶

allele_freq and post_mean are two different ways of estimating the average allele frequency for a set of studies. Because they are estimating the same thing they are very similar in most cases. However, in some situations they can diverge, in this section we discuss why that happens and what to do about it.

In no particular order the main causes are:

  • Inappropriate prior
  • Heterogeneity in estimated frequency for a single allele
  • Issues with model convergence

Inappropriate prior¶

Sometimes the post_mean and credible intervals are clearly too low. We have seen this when there are many alleles (usually >100) compared to the amount of data and many of these alleles are very rare. This is covered in more detail in the working with priors example. Fortunately the fix is quite simple; instead of the default prior (a vector of $1$ s) use a vector of $1/k$ s where $k$ is the number of alleles. In simulations this resolved this issue even in extreme cases. Alternatively the default estimate of allele_freq is far more robust to this issue.

In [ ]:
# When k is the number of alleles
prior=[1/k]*k

Heterogeneity in estimated frequency for a single allele¶

Below we calculate and plot the allele frequencies for DQB1 in Italy. DQB1*02:01 at the bottom of the plot shows that one study reports a much higher frequency than the others. This draws the allele_freq estimate (shown as a black open point) up and away from the other studies. This is analagous to an outlier skewing a mean. Whereas the credible intervals (the black horizontal line) and post_mean estimates remain centered on the majority of studies.

In [6]:
base_url = HLAfreq.makeURL("Italy", locus="DQB1")
aftab = HLAfreq.getAFdata(base_url)
aftab = HLAfreq.only_complete(aftab)
HLAfreq.check_resolution(aftab)
aftab = HLAfreq.decrease_resolution(aftab, 2)
caf = HLAfreq.combineAF(aftab)
hdi = HLAhdi.AFhdi(aftab,credible_interval=0.95)
HLAfreq.plotAF(caf, aftab, hdi=hdi)
3 pages of results
Download complete
population   loci
Italy pop 4  DQB1    0.574
Name: allele_freq, dtype: float64
1 studies have total allele frequency < 0.95
2    206
3      1
Name: allele, dtype: int64
Multiple resolutions in AFtab. Fix with decrease_resolution()
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [frac, conc]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 11 seconds.
WARNING: The default allele frequency estimate is outside of the CI estimated by the compound method for some alleles!
There are several possible reasons, see the credible intervals example: https://github.com/BarinthusBio/HLAfreq/blob/main/examples/credible_intervals.ipynb
If you have set `credible_interval` to < 0.95, this may be a non-issue.
No description has been provided for this image

In this case allele_freq and post_mean diverge because by modelling the sampling variance post_mean (and credible intervals) can better estimate the country allele frequency without being overly influenced by unusual studies.

What to do about this difference depends on your study. The post_mean estimate is probably the better estimate of average allele frequency for these datasets but it is important to check these datasets to make sure they are relevent to your study population. Below we look at each study estimate for DQB1*02:01.

In [7]:
aftab[aftab.allele=="DQB1*02:01"]
Out[7]:
allele loci population allele_freq sample_size
0 DQB1*02:01 DQB1 Germany DKMS - Italy minority 0.1898 1159
17 DQB1*02:01 DQB1 Italy Bergamo 0.0910 101
54 DQB1*02:01 DQB1 Italy Central 0.1730 380
69 DQB1*02:01 DQB1 Italy North pop 3 0.0730 97
102 DQB1*02:01 DQB1 Italy Rome 0.1200 100
144 DQB1*02:01 DQB1 Italy Sardinia pop2 0.5900 1129
159 DQB1*02:01 DQB1 Italy pop 2 0.1980 53
171 DQB1*02:01 DQB1 Italy pop 3 0.2300 179
185 DQB1*02:01 DQB1 Italy pop 5 0.0780 975

The high frequency estimate comes from the island population of Sardinia rather than mainland Italy. Depending on your study question you might choose to remove some studies and recalculate estimates. Note that you should remove the entire study, not just the alleles with different frequencies. Another study you might remove is Germany DKMS-Italy minority, as these are registered bone marrow donors in Germany who assigned their origin as Italy. Details about how data was collected are available on allelefrequencies.net if you look up the population.

Issues with model convergence¶

If the model fails to converge the post_mean and credible interval estimates should not be used. Fortunately this appears to be rare and we have not observed it. However, if it does happen you could either use the default allele_freq, or dive into pymc to try and fit a better model.

The following sections describe diagnosing convergence issues and details of the model itself.

Model details¶

This section describes the variance model in more detail as well as how to access the model and assess convergence.

The variance model used to estiamte high posterior density interval and post_mean is a Marginalized Dirichlet-Multinomial Model in PyMc as described here.

In brief, the global allele frequency is modelled as a Dirichlet distribution, and each population (defined by datasetID) is a Dirichlet distribution draw from the global Dirichlet distribution, and the observed allele count data of that population is multinomial count data drawn from the population Dirichlet distribution.

The observed allele frequencies are transformed into allele counts using weights. The variability of population allele frequencies around the global mean is defined by a latent, lognormal variable conc.

Fitting the model¶

If you suspect that the model is failing to converge you can fit the model and check it directly. The code to fit the model below is the same that AFhdi() uses internally.

In [9]:
# Format data and fit variance model
c_array, allele_names = HLAhdi._make_c_array(aftab)
idata = HLAhdi._fit_Dirichlet_Multinomial(c_array)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [frac, conc]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 12 seconds.

Validating the model¶

Once you have fit the model you can validate it as you would any other PyMC model, in particular I recommend posterior predictive checks which can show that your estimates are being overly influenced by the prior as can happen with small datasets and large numbers of alleles. A basic overview is available here and a more detailed check for biased parameters is described here.

Some basic checks are shown below. It's a good idea to check the r_hat values are close to 1. Higher r_hat values indicate that some chains have failed to converge and pymc will also give warnings for high values.

In [10]:
import arviz as az
az.summary(idata)
Out[10]:
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
frac[0] 0.144 0.025 0.095 0.189 0.000 0.000 5064.0 2570.0 1.0
frac[1] 0.020 0.008 0.007 0.034 0.000 0.000 4900.0 2553.0 1.0
frac[2] 0.003 0.002 0.000 0.008 0.000 0.000 4228.0 2307.0 1.0
frac[3] 0.209 0.030 0.152 0.264 0.000 0.000 4433.0 2760.0 1.0
frac[4] 0.072 0.017 0.042 0.104 0.000 0.000 5285.0 2477.0 1.0
frac[5] 0.042 0.011 0.021 0.063 0.000 0.000 4996.0 2143.0 1.0
frac[6] 0.015 0.006 0.005 0.025 0.000 0.000 4635.0 2491.0 1.0
frac[7] 0.008 0.004 0.002 0.015 0.000 0.000 4582.0 2966.0 1.0
frac[8] 0.002 0.002 0.000 0.005 0.000 0.000 2501.0 1628.0 1.0
frac[9] 0.002 0.002 0.000 0.004 0.000 0.000 2900.0 1691.0 1.0
frac[10] 0.002 0.002 0.000 0.005 0.000 0.000 2636.0 1445.0 1.0
frac[11] 0.002 0.002 0.000 0.005 0.000 0.000 3434.0 1832.0 1.0
frac[12] 0.002 0.002 0.000 0.004 0.000 0.000 2833.0 1737.0 1.0
frac[13] 0.003 0.002 0.000 0.007 0.000 0.000 4053.0 2244.0 1.0
frac[14] 0.002 0.002 0.000 0.004 0.000 0.000 2968.0 1520.0 1.0
frac[15] 0.002 0.002 0.000 0.005 0.000 0.000 2586.0 1530.0 1.0
frac[16] 0.003 0.002 0.000 0.008 0.000 0.000 3245.0 1888.0 1.0
frac[17] 0.005 0.003 0.001 0.011 0.000 0.000 3413.0 2071.0 1.0
frac[18] 0.040 0.011 0.021 0.062 0.000 0.000 4816.0 2444.0 1.0
frac[19] 0.003 0.002 0.000 0.007 0.000 0.000 3833.0 1695.0 1.0
frac[20] 0.089 0.019 0.055 0.123 0.000 0.000 5406.0 2574.0 1.0
frac[21] 0.070 0.016 0.042 0.101 0.000 0.000 5408.0 2315.0 1.0
frac[22] 0.052 0.013 0.029 0.077 0.000 0.000 5539.0 2855.0 1.0
frac[23] 0.005 0.003 0.001 0.010 0.000 0.000 3347.0 1889.0 1.0
frac[24] 0.027 0.008 0.012 0.043 0.000 0.000 3708.0 2691.0 1.0
frac[25] 0.048 0.012 0.026 0.071 0.000 0.000 5204.0 2544.0 1.0
frac[26] 0.049 0.013 0.026 0.072 0.000 0.000 4751.0 2326.0 1.0
frac[27] 0.044 0.012 0.022 0.066 0.000 0.000 4116.0 2295.0 1.0
frac[28] 0.003 0.002 0.000 0.008 0.000 0.000 3708.0 2002.0 1.0
frac[29] 0.002 0.002 0.000 0.004 0.000 0.000 2600.0 1636.0 1.0
frac[30] 0.002 0.002 0.000 0.004 0.000 0.000 2956.0 1951.0 1.0
frac[31] 0.002 0.002 0.000 0.004 0.000 0.000 3457.0 2407.0 1.0
frac[32] 0.012 0.005 0.004 0.021 0.000 0.000 3938.0 2234.0 1.0
frac[33] 0.002 0.002 0.000 0.005 0.000 0.000 2976.0 1941.0 1.0
frac[34] 0.002 0.002 0.000 0.005 0.000 0.000 3412.0 1919.0 1.0
frac[35] 0.002 0.002 0.000 0.004 0.000 0.000 2827.0 1542.0 1.0
frac[36] 0.002 0.002 0.000 0.004 0.000 0.000 3072.0 1448.0 1.0
frac[37] 0.002 0.002 0.000 0.004 0.000 0.000 2261.0 1217.0 1.0
frac[38] 0.002 0.002 0.000 0.004 0.000 0.000 2664.0 1623.0 1.0
frac[39] 0.002 0.002 0.000 0.005 0.000 0.000 3466.0 2163.0 1.0
frac[40] 0.002 0.002 0.000 0.005 0.000 0.000 3012.0 1648.0 1.0
frac[41] 0.002 0.002 0.000 0.004 0.000 0.000 3740.0 2475.0 1.0
frac[42] 0.002 0.002 0.000 0.005 0.000 0.000 2703.0 1464.0 1.0
frac[43] 0.002 0.002 0.000 0.005 0.000 0.000 3380.0 2648.0 1.0
conc 19.339 3.001 13.982 25.118 0.054 0.038 3001.0 2386.0 1.0

An energy plot can check that the sampler performed well, if it did the two distributions should be similar.

In [11]:
az.plot_energy(idata)
Out[11]:
<AxesSubplot: >
No description has been provided for this image

Trace plots can also be used to check convergence and mixing. In the left hand plots we want the different chains to have converged to the same distributions. And in the right hand plot we want to see no trends or evidence of chains getting stuck.

With large numbers of alleles you may have to increase the max_subplots to see all alleles and avoid warnings.

In [12]:
az.plot_trace(idata, compact=False, combined=False)
/home/dwells/miniconda3/envs/hlafreq/lib/python3.10/site-packages/arviz/plots/traceplot.py:217: UserWarning: rcParams['plot.max_subplots'] (20) is smaller than the number of variables to plot (45), generating only 20 plots
  warnings.warn(
Out[12]:
array([[<AxesSubplot: title={'center': 'frac\n0'}>,
        <AxesSubplot: title={'center': 'frac\n0'}>],
       [<AxesSubplot: title={'center': 'frac\n1'}>,
        <AxesSubplot: title={'center': 'frac\n1'}>],
       [<AxesSubplot: title={'center': 'frac\n2'}>,
        <AxesSubplot: title={'center': 'frac\n2'}>],
       [<AxesSubplot: title={'center': 'frac\n3'}>,
        <AxesSubplot: title={'center': 'frac\n3'}>],
       [<AxesSubplot: title={'center': 'frac\n4'}>,
        <AxesSubplot: title={'center': 'frac\n4'}>],
       [<AxesSubplot: title={'center': 'frac\n5'}>,
        <AxesSubplot: title={'center': 'frac\n5'}>],
       [<AxesSubplot: title={'center': 'frac\n6'}>,
        <AxesSubplot: title={'center': 'frac\n6'}>],
       [<AxesSubplot: title={'center': 'frac\n7'}>,
        <AxesSubplot: title={'center': 'frac\n7'}>],
       [<AxesSubplot: title={'center': 'frac\n8'}>,
        <AxesSubplot: title={'center': 'frac\n8'}>],
       [<AxesSubplot: title={'center': 'frac\n9'}>,
        <AxesSubplot: title={'center': 'frac\n9'}>],
       [<AxesSubplot: title={'center': 'frac\n10'}>,
        <AxesSubplot: title={'center': 'frac\n10'}>],
       [<AxesSubplot: title={'center': 'frac\n11'}>,
        <AxesSubplot: title={'center': 'frac\n11'}>],
       [<AxesSubplot: title={'center': 'frac\n12'}>,
        <AxesSubplot: title={'center': 'frac\n12'}>],
       [<AxesSubplot: title={'center': 'frac\n13'}>,
        <AxesSubplot: title={'center': 'frac\n13'}>],
       [<AxesSubplot: title={'center': 'frac\n14'}>,
        <AxesSubplot: title={'center': 'frac\n14'}>],
       [<AxesSubplot: title={'center': 'frac\n15'}>,
        <AxesSubplot: title={'center': 'frac\n15'}>],
       [<AxesSubplot: title={'center': 'frac\n16'}>,
        <AxesSubplot: title={'center': 'frac\n16'}>],
       [<AxesSubplot: title={'center': 'frac\n17'}>,
        <AxesSubplot: title={'center': 'frac\n17'}>],
       [<AxesSubplot: title={'center': 'frac\n18'}>,
        <AxesSubplot: title={'center': 'frac\n18'}>],
       [<AxesSubplot: title={'center': 'frac\n19'}>,
        <AxesSubplot: title={'center': 'frac\n19'}>]], dtype=object)
No description has been provided for this image
In [ ]: