The sequences of 150,119 genomes in the UK Biobank – Nature

Datasets

UKB data

The UKB phenotype and genotype data were collected following informed consent obtained from all participants. The North West Research Ethics Committee reviewed and approved the scientific protocol and operational procedures (REC reference number: 06/MRE08/65) of the UKB. Data for this study were obtained and research conducted under the UKB applications license numbers 24898, 52293, 68574 and 69804. Sequence data were processed as described in Supplementary Notes 1–4, Supplementary Figs. 5–8 and Supplementary Tables 16 and 17.

Phenotypes were downloaded from the UKB. A total of 8,180, 1,291 and 459 phenotypes were constructed for the XBI, XAF and XSA cohorts, respectively. The examples presented here were selected as noteworthy representative examples of association. The processing of phenotypes presented here, with reference to the field identity in the UKB data showcase, is provided in Supplementary Table 15.

Icelandic data

The gout sample set60, a total of 1,740 Icelandic individuals, was recruited through multiple sources. A subset of these individuals were regular users of anti-gout medication corresponding to the Anatomical Therapeutic Chemical Classification System class M04 (ATC-M04). Individuals using ATC-M04 were identified through questionnaires at the time of entry into genetics projects at deCODE and provided by the Directorate of Heahth from entry in the Prescription Medicines Register (2005-2020) or the Register of RAI Assessments and Minimum Data Set (MDS) for residents and applicants of nursing homes (1993–2018). Furthermore, about one-half had received a clinical diagnosis of gout (International Classification of Disease: ICD-9 code 274 or ICD-10 code M10) between 1984 and 2019 at Landspitali, the National University Hospital of Iceland, or at two rheumatology clinics, or such a diagnosis was determined by examining RAI and MDS medical records.

Serum levels of uric acid in blood samples from 95,086 Icelandic individuals were obtained from Landspitali, the National University Hospital of Iceland, and the Icelandic Medical Center (Laeknasetrid) Laboratory in Mjodd (RAM) between 1990 and 2020. Serum levels of uric acid were normalized to a standard normal distribution using quantile–quantile normalization and then adjusted for sex, year of birth and age at measurement. For individuals for whom more than one measurement was available, we used the average of the normalized value. Serum levels of uric acid were determined from an enzymatic reaction in which uricase oxidizes urate to allantoin and hydrogen peroxide, which, with the aid of peroxidase and a dye, forms a coloured complex that can be measured in a photometer at a wavelength of 670 nm.

All participating individuals who donated blood signed informed consent. The identities of participants were encrypted using a third-party system approved and monitored by the Icelandic Data Protection Authority. The study was approved by the National Bioethics Committee of Iceland (approval no. VSN-15-023) following evaluation of the Icelandic Data Protection Authority. All data processing complies with the instructions of the Data Protection Authority (PV_2017060950ÞS).

RNA sequence data analysis was approved by the Icelandic Data Protection Authority and the National Bioethics Committee of Iceland (no. VSNb2015030021).

Danish data

Data were provided from the Danish Blood Donor Study (DBDS)61. The DBDS genetic study has been approved by the Danish National Committee on Health Research Ethics (NVK-1700407) and by the Danish Capital Region Data Protection Office (P-2019-99).

SNP and indel calling with GraphTyper

Before running GraphTyper, we preprocessed all input compressed reference-oriented alignment map (CRAM) index (CRAI) indices by extracting a large single file containing all CRAI index entries with sample ID for a 50-kb window (with 1-kb padding at each side of the region) for all samples. For each region, we then created a chopped CRAI for each sample by processing the large file for the corresponding region, substantially reducing the amount of CRAI index entries read.

Furthermore, we created a sequence cache of the reference FASTA file using the ‘seq_cache_populate.pl’ script distributed with samtools 1.9. In each region, we copied the corresponding sequence cache to the local disk and used it for reading the CRAM files by setting the ‘REF_CACHE’ environment variable.

We ran GraphTyper (v2.7.1) using the ‘genotype’ subcommand. The full command that we ran was in the format:

graphtyper genotype ${UKBIO_REFERENCE} –sams=${SAMS} –sams_index=${CRAI_TMP}/crai_filelist.txt –avg_cov_by_readlen=${COVERAGES} –region=${REGION} –threads=${THREADS} –verbose

Where UKBIO_REFERENCE is the GRCh38_full_analysis_set_plus_decoy_hla FASTA sequence file, SAMS is a list of all input BAM/CRAM files, CRAI_TMP is a path to the chopped CRAI files on the local disk, COVERAGES is the coverage divided by the read length for each input file, REGION is the genotyping region and THREADS is the number of threads to use.

SNP and indel calling with GATK is given in Supplementary Note 5. Detailed comparisons of GraphTyper and GATK call sets are provided in Supplementary Notes 6 and 7, Supplementary Figs. 9–12 and Supplementary Tables 18–21.

Running time

All jobs were run using 12 cores with 60 GB of reserved RAM. Approximately 1% of jobs were rerun using 24 cores with 120 GB reserved RAM. A few jobs requiring more cores and memory, with a single job finishing with 48 cores and 1,000 GB of RAM. Total reserved CPU time on cluster was 5.8 million CPU hours and total effective compute time of 5.0 million CPU hours. The difference in these numbers is explained by the fact that not all cores reserved for the program may not utilize all at the same time.

APPLY  Reaching Closer to Earth’s Core, One Lava Scoop at a Time

SV calling with Manta and GraphTyper

We ran a SV genotyping pipeline similar to the one that we had previously applied to 49,962 Icelandic individuals50. In summary, we ran Manta48 v1.6 to discover SVs on all 150,119 individuals in the genotyping set. We also created a set of highly confident common SVs (imputation information above 0.95, with frequency above 0.1%) from our previous studies using both Illumina short reads50 and Oxford Nanopore long-read data49. Finally, we inferred a set of SVs from six publicly available assembly datasets using dipcall62, as previously described50. We used svimmer50 to merge these different SV datasets and we called the resulting SVs using GraphTyper50 version 2.7.1. By incorporating data from long-read data and high-quality assemblies, we were able to call more true SVs than using short reads only, particularly for common SVs.

A total of 895,054 variants were called, of which 637,321 variants were annotated as “Pass”. Variant counts are presented for variants annotated by GraphTyper as “Pass”, unless otherwise noted.

The majority of the SVs are deletions (81.3%); however, we observed only slightly more deletions than insertions and duplications on average per individual (Fig. 4a). This is because the source for many insertions are long reads and assembly data, and thus many rare insertions are missing. Deletions are typically easier to discover in short-read data. Individuals who belong to the XAF cohort carry more SVs than in the other cohorts (Fig. 4b).

Imputation and phasing

The UKB samples were SNP chip genotyped with a custom-made Affymetrix chip, UK BiLEVE Axiom, in the first 50,000 individuals63, and the Affymetrix UKB Axiom array64 in the remaining participants. We used the existing long-range phasing of the SNP chip-genotyped samples5.

We excluced SNP and indel sequence variants in which at least 50% of the samples had no coverage (GQ score = 0), if the Hardy–Weinberg P value was less than 10−30 or if heterozygous excess was less than 0.5 or greater than 1.5.

We used the remaining sequence variants and the long-range-phased chip data to create a haplotype reference panel using in-house tools1,65. We then imputed the haplotype reference panel variants into the chip-genotyped samples using in-house tools and methods previously described1,65.

The imputation consists of estimating, for each haplotype, haplotype sharing with haplotypes in the haplotype reference panel, giving haplotype weights for each haplotype. These weights along with allele probabilities for each haplotype in the haplotype reference panel allow imputation with a Li and Stephens66 model similar to the one used in IMPUTE2 (ref. 67). Estimation of haplotype weights was based on long-range-phased chip haplotypes.

Sequence variant phasing consists of iteratively imputing the phase in each sequenced sample based on the other sequenced samples and the estimated phase from the last iteration. The imputed genotypes, along with the original genotypes, are weighted together to estimate new allele probabilites for the haplotypes. Imputation is done as described above.

We computed a leave-one-out r2 score (L1oR2) as the squared correlation (r2 value) of the original genotype calls, with the genotypes imputed for each sample when excluding the original genotype of the sample from the imputation input.

Batch effects from the sequencing centre were discovered in both raw genotype (Supplementary Table 21) and imputed data (Supplementary Table 22).

Identification of functionally important regions

To identify functionally important regions, we started by estimating whether reliable basecalls can be expected to be made at each site in the genome. The sequence coverage at each base pair in GRCh38 was computed for each of the 1,000 randomly selected individuals. At each base pair, we then computed the mean and s.d. of coverage across the 1,000 individuals. Base pairs with mean coverage of at least 20 and s.d. coverage of at most 12 were considered reliable base pairs. Only variants in GraphTyperHQ (AAscore > 0.5) were considered in the analysis.

Recurrent mutations and spectra under saturation

Using the classification of SNP variants from above, we calculated the ratio of all SNPs in GraphTyperHQ that falls into each category. Then, we did the same restricting to singletons, that is, calculate the proportion of singletons falling into each mutation class. For comparison, we calculated the fractions of each SNP class in all 181,258 SNPs from a curated list of 194,687 de novo mutations in 2,976 Icelandic trios20. We used this distribution on mutation classes to calculate the transition to tranversion ratio in each case.

To get a list of recurrent mutations, we joined this list of de novo mutations with GraphTyperHQ.

Saturation for general mutation classes

We restricted our analysis to the reliable base pairs described above and grouped base pairs and their complement and considered each A or T base in the genome as a mutation opportunity for T>A, T>C or T>G mutations. Similarly, we considered each G or C base as a potential C>A, C>G or C>T mutation, splitting C>T into two classes based on whether they occur in a CpG context. We then computed the saturation ratio as the number of observed mutations in GraphTyperHQ divided by the number of mutation opportunities at reliable base pairs. Computation was done separately for the autosomes and chromosome X. 95% CIs were computed using a normal approximation to the binomial distribution, treating each site as an independent observation.

APPLY  Strange new phase of matter created in quantum computer acts like it has two time dimensions

Sites methylated in the germline

We determined sites on GRCh38 that are methylated in the germ line using ENCODE whole-genome bisulfite sequencing9 data from samples of human testes and ovaries. More precisely, we used sample ENCFF946UQB and ENCFF157ZPP for testes and ENCFF561KYJ, ENCFF545XYI and ENCFF515OOQ for ovaries.

We assumed that methylation is strand symmetric and computed the methylation ratio for each CpG dinucleotide in a given tissue type by tabulating the number of reads supporting methylation or non-methylation in each dinucleotide, summing over all samples of a given tissue type and then computed the fraction of reads that support methylation.

We considered a site in a CpG dinucleotide on the reference genome methylated in the germ line if its methylation ratio was at least 0.7 in both testes and ovaries, and the combined depth was at least 20 for testes and 30 for ovaries, or 10 times the number of samples in each tissue type. This resulted in a list of 17,902,255 CpG (17,345,777 autosomal) dinucleotides, with 35,804,510 (34,691,554 autosomal) CpG>TpG mutation opportunities.

Saturation at methylated CpG sites

For each potential CpG>TpG at a methylated site, we assessed its most significant potential consequence with Variant Effect Predictor68 v. 100. In the case of multiple such consequences, we chose the alphabetically last one. We also classified them based on the functional classifications described above. For each class, we estimated the saturation as the ratio of variants of that functional class in GraphTyperHQ divided by the number of mutation opportunities. 95% CIs were computed using a normal approximation to the binomial distribution, treating each site as an independent observation.

Depletion rank

We followed a methodology akin to a previously published study27. A variant depletion score was computed for an overlapping set of 500-bp windows in the genome with a 50-bp step size. A total of 49,104,026 500-bp windows in which at least 450 bp were considered reliable base pairs were considered for further analysis. We tallied the number of occurrences of each possible heptamer (H) and the number of times the central base pair in the heptamer was observed as a SNP (S), across the first set of non-overlapping windows. To account for regional mutational patterns in the genome69, we dichotomized the genome into two mutually exclusive subsets, inside and outside C>G-enriched regions (Supplementary Table 12 in ref. 69). The ratio S:H was then interpreted as the expected mutation rate of the heptamer, separately for each of the two subsets. For each window, we then computed the observed number of variants (O) and then subtracted its expected number of variants (E), given its heptamers. This difference was divided by the square root of the expected value ((O−E)/√E). We exclued windows from the analysis in which the average AAscore was lower than 0.85 for variants within the window. These ((O−E)/√E) numbers were then sorted and the window with the i-th lowest depletion score was assigned a DR of 100(i−0.5)/n, where n is the total number of windows.

To compute DR restricted to the cohorts, we applied the same approach restricting to sequence variants that are present in each of the XBI, XSA and XAF cohorts.

Association testing

We tested for association with quantitative traits based on the linear mixed model implemented in BOLT-LMM70. We used BOLT-LMM to calculate leave-one-chromosome out residuals, which we then tested for association using simple linear regression. We used logistic regression to test for the association between sequence variants and binary traits. We tested variants for association under the additive model using the expected allele counts as a covariate for quantitative traits and integrating over the possible genotypes for binary traits. Sequencing status (whether the individual is one of the WGS individuals) and other available individual characteristics that correlated with the trait were also included in the model: sex, age and principal components (20 for XBI and XAF, 45 for XSA) to adjust for population stratification. Association analyses with XAF and XSA ethnicities have sample sizes of less than 10,000 and therefore were done with linear regression directly instead of BOLT-LMM. The correction factor used was the intercept of each regression analysis.

We used linkage disequilibrium (LD) score regression to account for distribution inflation in the dataset due to cryptic relatedness and population stratification71. Using 1.1 million variants, we regressed the χ2 statistics from our GWAS against the LD score and used the intercepts as a correction factor. Effect sizes based on the leave-one-chromosome out residuals were shrunk and we rescaled them based on the shrinkage of the 1.1 million variants used in the LD score regression. Supplementary Table 24 lists statistics for the GWAS analysis of each of the association signals presented here. Manhattan plots, quantile–quantile plots and histograms of inverse-normal-transformed values after adjustment for covariates age, sex and 40 principal components can be found in Supplementary Figs. 14 and 15 for quantitative and binary phenotypes, respectively. Locus plots for uric acid and menarche association can be found in Supplementary Fig. 16. OMIM32 and Open Targets72 annotations of the genes presented are provided in Supplementary Table 14.

APPLY  Look! Webb data reveals stunning images of potentially habitable star system

No statistical methods were used to predetermine sample size for association testing. All associations reported are for imputed genotypes. For comparison purposes, associations were also performed on the genotypes directly. For the association testing perfomed on the directly genotyped markers, the same set of covariates were used, apart from sequencing status (as all individuals were sequenced), and also the sequencing centre (deCODE, Sanger main, Sanger Vanguard) was used as a covariate. Supplementary Table 25 shows the correlation between the raw and the imputed genotypes and batch effects for sequencing centre in the XBI cohort.

An individual was deemed to be a carrier of an allele if the probability that the individual carried the allele was at least 0.9. The association analysis was limited to markers in which at least one (XAF, XSA), two (XBI, imputed dataset) or three (XBI, raw genotypes) individuals carried the minor allele. As association tests are frequently limited to a subset of the individuals in the dataset, the association analysis was further limited to those markers in which there was at least one carrier among the individuals in the association test. In the imputed dataset, association tests were further limited to those markers with imputation information > 0.5 and in the raw genotype set to those markers with sequencing information > 0.8 (ref. 1).

Defining cohorts

Most studies of UKB data to date have been conducted on a list of 409,554 ‘white British’ individuals created by the UKB on the basis of white British self-identification and clustering on genetic principal components derived from microarray genotypes5. Like some recent studies44,73,74, we wished to capitalize on the diversity in the UKB. To achieve this, we defined three cohorts based on the most common ancestries identified among the participants, using a combination of (1) uniform manifold approximation and projection (UMAP) dimension reduction of 40 genetic principal components provided by UKB, and (2) ADMIXTURE analysis supervised on five reference populations and self-reported ethnicity information.

To define the three cohorts, we followed previous work75 and applied UMAP to the 40 genetic principal components provided by the UKB. UMAP was performed in R using umap::umap() using default parameters in v0.2.3, notably, n_neighbours 15 and min_dist 0.1. UMAP placed the individuals in a two-dimensional latent space featuring several clusters and filaments. These structures showed a correspondence with self-described ethnicity (Supplementary Fig. 17).

To provide a separate measure of ancestry that we could use to inform our interpretation of the UMAP clusters, we superimposed results from a supervised ADMIXTURE58 analysis of the UKB microarray genotypes (Supplementary Section ADMIXTURE), using five training populations from the 1000 Genomes Project8: CEU (northern Europeans from Utah), CHB (Han Chinese in Beijing), ITU (Indian Telugu in the UK), PEL (Peruvians in Lima) and YRI (Yoruba in Ibadan, Nigeria). We observed a clear correspondence between UMAP coordinates and ancestry proportions assigned by ADMIXTURE (Supplementary Figs. 18 and 19). Using this correspondence and guided by self-reported ethnicity information, we defined the cohorts by manually delineating regions in the UMAP latent space that were limited to individuals with British–Irish ancestry (XBI; n = 431,805), South Asian ancestry (XSA; n = 9,633) and African ancestry (XAF; n = 9,252). This left 37,598 individuals with genotype data, who were assigned to an arbitrary cohort that we refer to as OTH (for other). The distribution of ancestry was estimated using ADMIXTURE in each of the four cohorts (Supplementary Fig. 18).

The most systematic difference between the XBI cohort and the prevailing UKB-defined white British set is our inclusion in XBI of around 12,500 individuals identifying as white Irish. This is clearly justified, given the known geographical and cultural proximity of the populations of Britain and the island of Ireland. More importantly, both our analyses (and those of previous publications) clearly reveal evidence for extensive gene flow between them. Thus, the main Irish genetic cluster appears in principal components analysis as an integrated component of continuous variation in the UK (Extended Data Fig. 2), and is not clearly separated from others. Another major difference of the XBI cohort relative to the much-used white British set, is the addition of around 10,900 individuals who did not identify as white British, but we infered to have ancestry indistinguishable from British–Irish individuals. We note that the greater size of the XBI cohort should provide more statistical power to detect genotype–phenotype associations. Cohort definitions are described in further detail in Supplementary Notes 16–22 and Supplementary Figs. 20–22.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this paper.

#sequences #genomes #Biobank #Nature

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button